# Activity 07 - Linear models

***
##### CS 434 - Dating Mining and Machine Learning
##### Oregon State University-Cascades
***

# Load packages

In [0]:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold
from scipy.stats.stats import pearsonr, spearmanr
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

# Dataset

[Communities and Crime Data Set](http://archive.ics.uci.edu/ml/datasets/communities+and+crime)

Predict *Violent Crimes Per Population* for communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

**Note**: this data is already standardized.


### Attributes

* **state:** US state (by number) - not counted as predictive above, but if considered, should be consided nominal (nominal)
* **county:** numeric code for county - not predictive, and many missing values (numeric)
* **community:** numeric code for community - not predictive and many missing values (numeric)
* **communityname:** community name - not predictive - for information only (string)
* **fold:** fold number for non-random 10 fold cross validation, potentially useful for debugging, paired tests - not predictive (numeric)
* **population:** population for community:** (numeric - decimal)
* **householdsize:** mean people per household (numeric - decimal)
* **racepctblack:** percentage of population that is african american (numeric - decimal)
* **racePctWhite:** percentage of population that is caucasian (numeric - decimal)
* **racePctAsian:** percentage of population that is of asian heritage (numeric - decimal)
* **racePctHisp:** percentage of population that is of hispanic heritage (numeric - decimal)
* **agePct12t21:** percentage of population that is 12-21 in age (numeric - decimal)
* **agePct12t29:** percentage of population that is 12-29 in age (numeric - decimal)
* **agePct16t24:** percentage of population that is 16-24 in age (numeric - decimal)
* **agePct65up:** percentage of population that is 65 and over in age (numeric - decimal)
* **numbUrban:** number of people living in areas classified as urban (numeric - decimal)
* **pctUrban:** percentage of people living in areas classified as urban (numeric - decimal)
* **medIncome:** median household income (numeric - decimal)
* **pctWWage:** percentage of households with wage or salary income in 1989 (numeric - decimal)
* **pctWFarmSelf:** percentage of households with farm or self employment income in 1989 (numeric - decimal)
* **pctWInvInc:** percentage of households with investment / rent income in 1989 (numeric - decimal)
* **pctWSocSec:** percentage of households with social security income in 1989 (numeric - decimal)
* **pctWPubAsst:** percentage of households with public assistance income in 1989 (numeric - decimal)
* **pctWRetire:** percentage of households with retirement income in 1989 (numeric - decimal)
* **medFamInc:** median family income (differs from household income for non-family households) (numeric - decimal)
* **perCapInc:** per capita income (numeric - decimal)
* **whitePerCap:** per capita income for caucasians (numeric - decimal)
* **blackPerCap:** per capita income for african americans (numeric - decimal)
* **indianPerCap:** per capita income for native americans (numeric - decimal)
* **AsianPerCap:** per capita income for people with asian heritage (numeric - decimal)
* **OtherPerCap:** per capita income for people with 'other' heritage (numeric - decimal)
* **HispPerCap:** per capita income for people with hispanic heritage (numeric - decimal)
* **NumUnderPov:** number of people under the poverty level (numeric - decimal)
* **PctPopUnderPov:** percentage of people under the poverty level (numeric - decimal)
* **PctLess9thGrade:** percentage of people 25 and over with less than a 9th grade education (numeric - decimal)
* **PctNotHSGrad:** percentage of people 25 and over that are not high school graduates (numeric - decimal)
* **PctBSorMore:** percentage of people 25 and over with a bachelors degree or higher education (numeric - decimal)
* **PctUnemployed:** percentage of people 16 and over, in the labor force, and unemployed (numeric - decimal)
* **PctEmploy:** percentage of people 16 and over who are employed (numeric - decimal)
* **PctEmplManu:** percentage of people 16 and over who are employed in manufacturing (numeric - decimal)
* **PctEmplProfServ:** percentage of people 16 and over who are employed in professional services (numeric - decimal)
* **PctOccupManu:** percentage of people 16 and over who are employed in manufacturing (numeric - decimal) ########
* **PctOccupMgmtProf:** percentage of people 16 and over who are employed in management or professional occupations (numeric - decimal)
* **MalePctDivorce:** percentage of males who are divorced (numeric - decimal)
* **MalePctNevMarr:** percentage of males who have never married (numeric - decimal)
* **FemalePctDiv:** percentage of females who are divorced (numeric - decimal)
* **TotalPctDiv:** percentage of population who are divorced (numeric - decimal)
* **PersPerFam:** mean number of people per family (numeric - decimal)
* **PctFam2Par:** percentage of families (with kids) that are headed by two parents (numeric - decimal)
* **PctKids2Par:** percentage of kids in family housing with two parents (numeric - decimal)
* **PctYoungKids2Par:** percent of kids 4 and under in two parent households (numeric - decimal)
* **PctTeen2Par:** percent of kids age 12-17 in two parent households (numeric - decimal)
* **PctWorkMomYoungKids:** percentage of moms of kids 6 and under in labor force (numeric - decimal)
* **PctWorkMom:** percentage of moms of kids under 18 in labor force (numeric - decimal)
* **NumIlleg:** number of kids born to never married (numeric - decimal)
* **PctIlleg:** percentage of kids born to never married (numeric - decimal)
* **NumImmig:** total number of people known to be foreign born (numeric - decimal)
* **PctImmigRecent:** percentage of _immigrants_ who immigated within last 3 years (numeric - decimal)
* **PctImmigRec5:** percentage of _immigrants_ who immigated within last 5 years (numeric - decimal)
* **PctImmigRec8:** percentage of _immigrants_ who immigated within last 8 years (numeric - decimal)
* **PctImmigRec10:** percentage of _immigrants_ who immigated within last 10 years (numeric - decimal)
* **PctRecentImmig:** percent of _population_ who have immigrated within the last 3 years (numeric - decimal)
* **PctRecImmig5:** percent of _population_ who have immigrated within the last 5 years (numeric - decimal)
* **PctRecImmig8:** percent of _population_ who have immigrated within the last 8 years (numeric - decimal)
* **PctRecImmig10:** percent of _population_ who have immigrated within the last 10 years (numeric - decimal)
* **PctSpeakEnglOnly:** percent of people who speak only English (numeric - decimal)
* **PctNotSpeakEnglWell:** percent of people who do not speak English well (numeric - decimal)
* **PctLargHouseFam:** percent of family households that are large (6 or more) (numeric - decimal)
* **PctLargHouseOccup:** percent of all occupied households that are large (6 or more people) (numeric - decimal)
* **PersPerOccupHous:** mean persons per household (numeric - decimal)
* **PersPerOwnOccHous:** mean persons per owner occupied household (numeric - decimal)
* **PersPerRentOccHous:** mean persons per rental household (numeric - decimal)
* **PctPersOwnOccup:** percent of people in owner occupied households (numeric - decimal)
* **PctPersDenseHous:** percent of persons in dense housing (more than 1 person per room) (numeric - decimal)
* **PctHousLess3BR:** percent of housing units with less than 3 bedrooms (numeric - decimal)
* **MedNumBR:** median number of bedrooms (numeric - decimal)
* **HousVacant:** number of vacant households (numeric - decimal)
* **PctHousOccup:** percent of housing occupied (numeric - decimal)
* **PctHousOwnOcc:** percent of households owner occupied (numeric - decimal)
* **PctVacantBoarded:** percent of vacant housing that is boarded up (numeric - decimal)
* **PctVacMore6Mos:** percent of vacant housing that has been vacant more than 6 months (numeric - decimal)
* **MedYrHousBuilt:** median year housing units built (numeric - decimal)
* **PctHousNoPhone:** percent of occupied housing units without phone (in 1990, this was rare!) (numeric - decimal)
* **PctWOFullPlumb:** percent of housing without complete plumbing facilities (numeric - decimal)
* **OwnOccLowQuart:** owner occupied housing - lower quartile value (numeric - decimal)
* **OwnOccMedVal:** owner occupied housing - median value (numeric - decimal)
* **OwnOccHiQuart:** owner occupied housing - upper quartile value (numeric - decimal)
* **RentLowQ:** rental housing - lower quartile rent (numeric - decimal)
* **RentMedian:** rental housing - median rent (Census variable H32B from file STF1A) (numeric - decimal)
* **RentHighQ:** rental housing - upper quartile rent (numeric - decimal)
* **MedRent:** median gross rent (Census variable H43A from file STF3A - includes utilities) (numeric - decimal)
* **MedRentPctHousInc:** median gross rent as a percentage of household income (numeric - decimal)
* **MedOwnCostPctInc:** median owners cost as a percentage of household income - for owners with a mortgage (numeric - decimal)
* **MedOwnCostPctIncNoMtg:** median owners cost as a percentage of household income - for owners without a mortgage (numeric - decimal)
* **NumInShelters:** number of people in homeless shelters (numeric - decimal)
* **NumStreet:** number of homeless people counted in the street (numeric - decimal)
* **PctForeignBorn:** percent of people foreign born (numeric - decimal)
* **PctBornSameState:** percent of people born in the same state as currently living (numeric - decimal)
* **PctSameHouse85:** percent of people living in the same house as in 1985 (5 years before) (numeric - decimal)
* **PctSameCity85:** percent of people living in the same city as in 1985 (5 years before) (numeric - decimal)
* **PctSameState85:** percent of people living in the same state as in 1985 (5 years before) (numeric - decimal)
* **LemasSwornFT:** number of sworn full time police officers (numeric - decimal)
* **LemasSwFTPerPop:** sworn full time police officers per 100K population (numeric - decimal)
* **LemasSwFTFieldOps:** number of sworn full time police officers in field operations (on the street as opposed to administrative etc) (numeric - decimal)
* **LemasSwFTFieldPerPop:** sworn full time police officers in field operations (on the street as opposed to administrative etc) per 100K population (numeric - decimal)
* **LemasTotalReq:** total requests for police (numeric - decimal)
* **LemasTotReqPerPop:** total requests for police per 100K popuation (numeric - decimal)
* **PolicReqPerOffic:** total requests for police per police officer (numeric - decimal)
* **PolicPerPop:** police officers per 100K population (numeric - decimal)
* **RacialMatchCommPol:** a measure of the racial match between the community and the police force. High values indicate proportions in community and police force are similar (numeric - decimal)
* **PctPolicWhite:** percent of police that are caucasian (numeric - decimal)
* **PctPolicBlack:** percent of police that are african american (numeric - decimal)
* **PctPolicHisp:** percent of police that are hispanic (numeric - decimal)
* **PctPolicAsian:** percent of police that are asian (numeric - decimal)
* **PctPolicMinor:** percent of police that are minority of any kind (numeric - decimal)
* **OfficAssgnDrugUnits:** number of officers assigned to special drug units (numeric - decimal)
* **NumKindsDrugsSeiz:** number of different kinds of drugs seized (numeric - decimal)
* **PolicAveOTWorked:** police average overtime worked (numeric - decimal)
* **LandArea:** land area in square miles (numeric - decimal)
* **PopDens:** population density in persons per square mile (numeric - decimal)
* **PctUsePubTrans:** percent of people using public transit for commuting (numeric - decimal)
* **PolicCars:** number of police cars (numeric - decimal)
* **PolicOperBudg:** police operating budget (numeric - decimal)
* **LemasPctPolicOnPatr:** percent of sworn full time police officers on patrol (numeric - decimal)
* **LemasGangUnitDeploy:** gang unit deployed (numeric - decimal - but really ordinal - 0 means NO, 1 means YES, 0.5 means Part Time)
* **LemasPctOfficDrugUn:** percent of officers assigned to drug units (numeric - decimal)
* **PolicBudgPerPop:** police operating budget per population (numeric - decimal)
* **ViolentCrimesPerPop:** total number of violent crimes per 100K popuation (numeric - decimal) GOAL attribute (to be predicted)

### GOAL attribute (to be predicted)

* **ViolentCrimesPerPop**: total number of violent crimes per 100K popuation (numeric - decimal) 

### Data url and column header

In [0]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data'
header = ['state','county','community','communityname string','fold','population','householdsize','racepctblack','racePctWhite','racePctAsian','racePctHisp','agePct12t21','agePct12t29','agePct16t24','agePct65up','numbUrban','pctUrban','medIncome','pctWWage','pctWFarmSelf','pctWInvInc','pctWSocSec','pctWPubAsst','pctWRetire','medFamInc','perCapInc','whitePerCap','blackPerCap','indianPerCap','AsianPerCap','OtherPerCap','HispPerCap','NumUnderPov','PctPopUnderPov','PctLess9thGrade','PctNotHSGrad','PctBSorMore','PctUnemployed','PctEmploy','PctEmplManu','PctEmplProfServ','PctOccupManu','PctOccupMgmtProf','MalePctDivorce','MalePctNevMarr','FemalePctDiv','TotalPctDiv','PersPerFam','PctFam2Par','PctKids2Par','PctYoungKids2Par','PctTeen2Par','PctWorkMomYoungKids','PctWorkMom','NumIlleg','PctIlleg','NumImmig','PctImmigRecent','PctImmigRec5','PctImmigRec8','PctImmigRec10','PctRecentImmig','PctRecImmig5','PctRecImmig8','PctRecImmig10','PctSpeakEnglOnly','PctNotSpeakEnglWell','PctLargHouseFam','PctLargHouseOccup','PersPerOccupHous','PersPerOwnOccHous','PersPerRentOccHous','PctPersOwnOccup','PctPersDenseHous','PctHousLess3BR','MedNumBR','HousVacant','PctHousOccup','PctHousOwnOcc','PctVacantBoarded','PctVacMore6Mos','MedYrHousBuilt','PctHousNoPhone','PctWOFullPlumb','OwnOccLowQuart','OwnOccMedVal','OwnOccHiQuart','RentLowQ','RentMedian','RentHighQ','MedRent','MedRentPctHousInc','MedOwnCostPctInc','MedOwnCostPctIncNoMtg','NumInShelters','NumStreet','PctForeignBorn','PctBornSameState','PctSameHouse85','PctSameCity85','PctSameState85','LemasSwornFT','LemasSwFTPerPop','LemasSwFTFieldOps','LemasSwFTFieldPerPop','LemasTotalReq','LemasTotReqPerPop','PolicReqPerOffic','PolicPerPop','RacialMatchCommPol','PctPolicWhite','PctPolicBlack','PctPolicHisp','PctPolicAsian','PctPolicMinor','OfficAssgnDrugUnits','NumKindsDrugsSeiz','PolicAveOTWorked','LandArea','PopDens','PctUsePubTrans','PolicCars','PolicOperBudg','LemasPctPolicOnPatr','LemasGangUnitDeploy','LemasPctOfficDrugUn','PolicBudgPerPop','ViolentCrimesPerPop']

*** 
# Exercise #1 - Load data
*** 

1.1 Load data and display. Read in `'?'` as `NaN`.

In [0]:
# load the dataset into a dataframe
print('your code here')

##### 1.2 Save the `'communityname string'`column as `df_community`.

In [0]:
# save 'communityname string' as df_community
print('your code here')

> We'll use the `'communityname string'` later to look at results, but we don't need it for learning.

##### 1.3 Drop `'communityname string'` and the other four non-descriptive columns

In [0]:
# drop 5 non-descriptive columns
print('your code here')

##### 1.4 Count the total number of missing values

In [0]:
# count the number of missing values 
print('your code here')

##### 1.5 Drop columns that contain `NaN`

In [0]:
# drop columns with NaN values
print('your code here')

##### 1.6 Confirm we have no missing values

In [0]:
# run to confirm there are no missing values
print('your code here')

*** 
# Exercise #2 - Train a `LinearRegression` model
*** 

In Exercises #2 and #3, we explore a linear `regression` model to a predict continuous value.

##### 2.1 Split data to X and y

In [0]:
# dependent variable y is 'ViolentCrimesPerPop'
print('your code here')

##### 2.2 Construct a `LinearRegression` model

In [0]:
# create a LinearRegression() model
print('your code here')

Self Check

```
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
```

##### 2.3 Run `k=10`-fold cross-validation

* Keep track of the `scores` for each fold and afterwards calulate the average `mean` and `std`.

* Keep track of the predictions in list `y_pred`

In [0]:
# k-fold CV
print('your code here')

##### 2.4 Save the `'ViolentCrimesPerPop'` column from `df` as `y_true`.

In [0]:
# save 'ViolentCrimesPerPop' as y_true
print('your code here')

##### 2.5 Confirm that `y_pred` and `y_true` are the same length

Use `==` to answer `True` or `False`.

In [0]:
# Confirm 'y_pred' and 'y_true' are same length (==)
print('your code here')

> Don't continue to the next section until the above is `True`.

*** 
# Exercise #3 - Explore `LinearRegression` results
*** 

##### 3.1 Create a dataframe `df_results`

* `'Community'` $\rightarrow$ `df_community`  (from earlier)
* `'Actual'` $\rightarrow$ `y_true`
* `'Predicted'` $\rightarrow$ `y_pred`

In [0]:
# create df_results consisting of community name, actual, and predicted value
print('your code here')

##### 3.2 Verify shape of `df_results`

In [0]:
# shape of df_results
print('your code here')

Self Check
> expecting `(1994, 3)`

##### 3.3 Describe `df_results`

In [0]:
# describe df_results
print('your code here')

##### 3.4 Scatterplot True vs. Predicted

Scatter plot True vs. Predicted
* add line `m, b = np.polyfit(x, y, 1)`
* add line `plt.plot(x, m*x + b, color='red')`
* where `x` and `y` are your `'y_true'` and `'y_pred'` series
* make sure to title your plot and axes

In [0]:
# make a scatter plot of 'true' vs 'predicted'
x = y_true
y = y_pred
print('your code here')

##### 3.5 Make a pandas [bar chart](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html) of the `df_results.head(20)`

Show the actual vs. predicted values for the first 20 results.

* set the figure size to be wide, e.g., `figsize=(16,10)`
* use `plt.grid` to add grid lines (major tick, `'-'`, `0.5` width, `green`)

In [0]:
# bar graph of the first 20 examples
print('your code here')

Self Check
> There should be two series (blue=`'Actual'` and orange=`'Predicted'`
>
> For `index=0`
> * `actual = 0.2`
> * `pred = 0.152479`

*** 
# Exercise #4 - Train a `LogisticRegression` model
*** 

In Exercises #4 and #5, we explore a linear `classification` model to a predict class labels.

##### 4.1 Make a copy of `df` named `df_logr`

In [0]:
# make a copy df called df_logr
print('your code here')

> We'll use `df_logr` for the remainder of this notebook

* Keep track of the `scores` for each fold and calulate the average `mean` and `std`.

* Keep track of the predictions in list `y_pred`

##### 4.2 Discretize `'ViolentCrimesPerPop'` using `pd.cut(..)`

* Use [pandas cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function to convert `'ViolentCrimesPerPop'` to discrete and convert to categorical values
* Use three bins (note: unequal width)
* Use bin ranges 
  * $[0, 0.25) \rightarrow $ `'low'`
  * $[0.25, 0.5) \rightarrow $ `'medium'`
  * $[0.5, 1.0] \rightarrow $ `'high'`

Note: you may need to extend the range in your bin description (e.g., $-0.01$ or $1.01$) to include boundary values of $0.0$ and $1.0$.

In [0]:
# discretize 'ViolentCrimesPerPop' and print df_logr
print('your code here')

##### 4.3 Split `df_logr` into `X` and `y`

In [0]:
# dependent variable y is 'ViolentCrimesPerPop'
print('your code here')

##### 4.4 Create a `LogisticRegression` classifier
* `random_state=1`
* `solver='lbfgs'`
* `max_iter=1000`

In [0]:
# create a LogisticRegression model
print('your code here')

##### 4.5 Run `k=10`-fold cross-validation

* Keep track of the `scores` for each fold and afterwards calulate the average mean and std.

* Keep track of the predictions in list `y_pred`

In [0]:
# k-fold CV
print('your code here')

Self Check
> The `std` across folds is `+/- 0.023`.

*** 
# Exercise #5 - Explore results
*** 

##### 5.1 Save the `'ViolentCrimesPerPop'` column from `df_logr` as `y_true`.

In [0]:
y_true = df_logr['ViolentCrimesPerPop']
print('your code here')

> Ultimately `y_true` should remain unchanged, but for thoroughness we take it from our current dataframe `df_logr`.

##### 5.2 Calculate the accuracy between `y_true` and `y_pred`.

In [0]:
# print the accuracy score
print('your code here')

##### 5.3 Graph the confusion matrix

In [0]:
# graph a confusion matrix
print('your code here')

##### 5.4 Calculate precison, recall, and F$_1$ score

This is a multi-class classication problem (we have three class labels).
* Use `average='weighted'` as the multi-class strategy

In [0]:
# precision tp / (tp + fp)
print('your code here')
print('Precision: %.3f' % precision)

# recall: tp / (tp + fn)
print('your code here')
print('Recall: %.3f' % recall)

# f1: 2 tp / (2 tp + fp + fn)
print('your code here')
print('F1 score: %.3f' % f1)

<img src="https://66.media.tumblr.com/dded9d1a2bf2068f92af9f7a9b6b5451/tumblr_p6s3hbPzgV1vd8jsjo1_500.gifv" width="300">