### Dataset Overview
This dataset details FBI crime rates in Virginia for 2013. Specifically, the dataset includes variables such as population, violent crime, murder, rape, robbery, aggravated assault, property crime, burglary, larceny-theft, motor vehicle theft, and arson.

A link to the dataset can be found here: https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table-8-state-cuts/table_8_offenses_known_to_law_enforcement_virginia_by_city_2013.xls

Dataset contains 149 observations across 12 columns

In [148]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import ensemble
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
%matplotlib inline

### Read data in, take a look, and clean it

In [149]:
va_crime = pd.read_csv('va_crime_2013.csv', skiprows=4)

In [150]:
va_crime.head(10)

Unnamed: 0,City,Population,Violent_Crime,Murder,Rape,Robbery,Aggravated_Assault,Property_Crime,Burglary,Larceny_Theft,Motor_Vehicle_Theft,Arson,Unnamed: 12
0,Abingdon,8186,10,0.0,3.0,1.0,6.0,233,20,198,15.0,4.0,
1,Alexandria,148519,258,5.0,21.0,118.0,114.0,2967,249,2427,291.0,13.0,
2,Altavista,3486,8,0.0,0.0,2.0,6.0,56,4,52,0.0,0.0,
3,Amherst,2223,2,0.0,2.0,0.0,0.0,27,6,19,2.0,0.0,
4,Appalachia,1728,12,0.0,2.0,2.0,8.0,77,25,51,1.0,0.0,
5,Ashland,7310,26,0.0,1.0,8.0,17.0,246,14,221,11.0,1.0,
6,Bedford,5894,12,0.0,4.0,3.0,5.0,237,26,199,12.0,0.0,
7,Berryville,4290,5,0.0,2.0,1.0,2.0,80,7,72,1.0,0.0,
8,Big Stone Gap,5568,17,0.0,5.0,0.0,12.0,203,21,176,6.0,2.0,
9,Blacksburg,42603,31,0.0,7.0,4.0,20.0,523,91,417,15.0,8.0,


In [151]:
#Drop empty column
va_crime.drop(va_crime.columns[len(va_crime.columns)-1], axis=1, inplace=True)

In [152]:
#Determine missing values across dataframe
missing_values_count = va_crime.isnull().sum()
print(missing_values_count)

City                   0
Population             2
Violent_Crime          2
Murder                 2
Rape                   2
Robbery                2
Aggravated_Assault     2
Property_Crime         2
Burglary               2
Larceny_Theft          2
Motor_Vehicle_Theft    2
Arson                  2
dtype: int64


In [153]:
#Drop missing values
va_crime = va_crime.dropna()

In [154]:
#Describe the data
va_crime.describe()

Unnamed: 0,Murder,Rape,Robbery,Aggravated_Assault,Motor_Vehicle_Theft,Arson
count,149.0,149.0,149.0,149.0,149.0,149.0
mean,1.295302,6.919463,20.234899,36.194631,32.496644,3.255034
std,4.78102,18.423882,72.760008,111.625561,115.290973,11.478625
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,1.0,0.0,4.0,2.0,0.0
75%,0.0,4.0,6.0,13.0,11.0,1.0
max,37.0,140.0,624.0,842.0,938.0,99.0


In [155]:
va_crime.dtypes

City                    object
Population              object
Violent_Crime           object
Murder                 float64
Rape                   float64
Robbery                float64
Aggravated_Assault     float64
Property_Crime          object
Burglary                object
Larceny_Theft           object
Motor_Vehicle_Theft    float64
Arson                  float64
dtype: object

In [156]:
#Define function to strip comma when there is one
def remove_comma_convert_int(x):
    str(x)
    if x.find(',') != -1:
        return int(x.replace(',', ''))
    else:
        return int(x)

In [157]:
#Apply remove_comma_convert_int function to all approprirate columns

va_crime['Population'] = va_crime['Population'].apply(remove_comma_convert_int)
va_crime['Violent_Crime'] = va_crime['Violent_Crime'].apply(remove_comma_convert_int)
va_crime['Property_Crime'] = va_crime['Property_Crime'].apply(remove_comma_convert_int)
va_crime['Burglary'] = va_crime['Burglary'].apply(remove_comma_convert_int)
va_crime['Larceny_Theft'] = va_crime['Larceny_Theft'].apply(remove_comma_convert_int)

In [158]:
#Convert float columns to int

va_crime['Murder'] = va_crime['Murder'].astype(int)
va_crime['Rape'] = va_crime['Rape'].astype(int)
va_crime['Robbery'] = va_crime['Robbery'].astype(int)
va_crime['Aggravated_Assault'] = va_crime['Aggravated_Assault'].astype(int)
va_crime['Motor_Vehicle_Theft'] = va_crime['Motor_Vehicle_Theft'].astype(int)
va_crime['Arson'] = va_crime['Arson'].astype(int)

### Engineer additional features

In [159]:
#Popuation_Squared
va_crime['Population_Squared'] = va_crime['Population']**2

In [160]:
#General Theft - Multiply Robbery, Larceny_Theft, Motor_Vehicle_Theft
va_crime['General_Theft'] = va_crime['Robbery'] * va_crime['Larceny_Theft'] * va_crime['Motor_Vehicle_Theft']

In [161]:
#Log of Population
va_crime['Population_Log'] = np.log(va_crime['Population'])

In [162]:
#Establish outcome variable (convert to binary)

va_crime['Rape'] = np.where((va_crime['Rape'] > 0), 1, 0)

In [163]:
#Let's take a look at our new and improved dataframe

va_crime.head(5)

Unnamed: 0,City,Population,Violent_Crime,Murder,Rape,Robbery,Aggravated_Assault,Property_Crime,Burglary,Larceny_Theft,Motor_Vehicle_Theft,Arson,Population_Squared,General_Theft,Population_Log
0,Abingdon,8186,10,0,1,1,6,233,20,198,15,4,67010596,2970,9.010181
1,Alexandria,148519,258,5,1,118,114,2967,249,2427,291,13,22057893361,83338326,11.908468
2,Altavista,3486,8,0,0,2,6,56,4,52,0,0,12152196,0,8.15651
3,Amherst,2223,2,0,1,0,0,27,6,19,2,0,4941729,0,7.706613
4,Appalachia,1728,12,0,1,2,8,77,25,51,1,0,2985984,102,7.45472


In [164]:
rape_total = va_crime['Rape'].sum()
print('Baseline accuracy for Rape is: ' + str(round((rape_total/va_crime.shape[0])*100, 2)) + '%')

Baseline accuracy for Rape is: 58.39%


### Let's start building our models - Goal is to achieve higher accuracy than the baseline of approximately 58%

## Regular Logistic Regression Model
Let's begin by using all features, with the exception of City. 

In [165]:
#Create dataframe slice for features
va_crime_features = va_crime.iloc[:,1:len(va_crime.columns)]

In [166]:
#Drop rape from features dataframe
va_crime_features.drop('Rape', axis=1, inplace=True)

In [167]:
# Declare a logistic regression classifier
lr = LogisticRegression()
Y = va_crime['Rape']
X = va_crime_features

# Fit the model.
fit = lr.fit(X, Y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print('\n Accuracy')
print(pd.crosstab(pred_y_sklearn, Y))

print('\n Percentage accuracy')
print(str(lr.score(X, Y)*100) + '%')

Coefficients
[[ 1.43931676e-12  1.36713605e-14  5.56832204e-17  2.47772853e-15
   7.06201725e-15  1.05936608e-13  1.18467259e-14  9.05329733e-14
   3.55690849e-15  4.11551641e-16  1.55662430e-08  1.27395495e-11
  -4.49826359e-15]]
[-7.81976176e-16]

 Accuracy
Rape    0   1
row_0        
1      62  87

 Percentage accuracy
58.38926174496645%


Our model has predicted Rape for every city in our dataset. This means that it is performing exactly the same as our baseline on our training data.

In [168]:
#Cross-Validation

display(cross_val_score(lr, va_crime_features, va_crime['Rape'], cv=10))

array([0.5625    , 0.5625    , 0.6       , 0.6       , 0.6       ,
       0.6       , 0.6       , 0.57142857, 0.57142857, 0.57142857])

For our test data on the other hand, we see a bit higher success rate, shown through the cross-validation above.

## Lasso Logistic Regression Model

In [177]:
# Declare a logistic regression classifier, using penalty 'l1' to indicate lasso
lr_lasso = LogisticRegression(penalty='l1')
Y = va_crime['Rape']
X = va_crime_features

# Fit the model.
fit = lr_lasso.fit(X, Y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn_lasso = lr_lasso.predict(X)

print('\n Accuracy')
print(pd.crosstab(pred_y_sklearn_lasso, Y))

print('\n Percentage accuracy')
print(str(lr_lasso.score(X, Y)*100) + '%')

Coefficients
[[-5.01013942e-05  2.90894454e+00  0.00000000e+00 -1.85459175e+00
  -2.64309664e+00 -2.08196294e-03  2.22824987e-02 -3.47162246e-03
   3.99909813e-02  0.00000000e+00 -3.14284531e-09 -1.31276180e-07
  -4.21234264e-01]]
[0.]

 Accuracy
Rape    0   1
row_0        
0      61   2
1       1  85

 Percentage accuracy
97.98657718120806%


In [178]:
#Cross-Validation

display(cross_val_score(lr_lasso, va_crime_features, va_crime['Rape'], cv=10))

array([0.875     , 0.9375    , 0.93333333, 0.86666667, 0.6       ,
       0.8       , 0.8       , 1.        , 0.78571429, 0.92857143])

During cross-validation, we see a lot of fluctuation for our lasso logistic regression model. Later on we will redo our feature set after a recursive feature selection process.

In [184]:
#Cross-Validation for revised lasso

lasso_score = cross_val_score(lr_lasso, va_crime_features_revised, va_crime['Rape'], cv=10)
lasso_score.mean()

0.9127380952380953

## Ridge Logistic Regression Model

In [142]:
# Declare a logistic regression classifier, using penalty 'l2' to indicate ridge
lr_ridge = LogisticRegression(penalty='l2')
Y = va_crime['Rape']
X = va_crime_features

# Fit the model.
fit = lr_ridge.fit(X, Y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn_ridge = lr_ridge.predict(X)

print('\n Accuracy')
print(pd.crosstab(pred_y_sklearn_ridge, Y))

print('\n Percentage accuracy')
print(str(lr_ridge.score(X, Y)*100) + '%')

Coefficients
[[ 1.43931676e-12  1.36713605e-14  5.56832204e-17  2.47772853e-15
   7.06201725e-15  1.05936608e-13  1.18467259e-14  9.05329733e-14
   3.55690849e-15  4.11551641e-16  1.55662430e-08  1.27395495e-11
  -4.49826359e-15]]
[-7.81976176e-16]

 Accuracy
Rape    0   1
row_0        
1      62  87

 Percentage accuracy
58.38926174496645%


In [143]:
#Cross-Validation

display(cross_val_score(lr_ridge, va_crime_features, va_crime['Rape'], cv=10))

array([0.5625    , 0.5625    , 0.6       , 0.6       , 0.6       ,
       0.6       , 0.6       , 0.57142857, 0.57142857, 0.57142857])

### Now that we've run several versions of the model, let's try using recursive feature selection on our original logistic regression model and try to improve based off those results

In [145]:
# Pass logistic regression model to the RFE constructor
from sklearn.feature_selection import RFE

selector = RFE(lr)
selector = selector.fit(va_crime_features, va_crime['Rape'])

In [146]:
print(selector.ranking_)

[1 1 8 6 3 1 2 1 5 7 1 1 4]


In [147]:
#Now turn into a dataframe so you can sort by rank

rankings = pd.DataFrame({'Features': va_crime_features.columns, 'Ranking' : selector.ranking_})
rankings.sort_values('Ranking')

Unnamed: 0,Features,Ranking
0,Population,1
1,Violent_Crime,1
5,Property_Crime,1
7,Larceny_Theft,1
10,Population_Squared,1
11,General_Theft,1
6,Burglary,2
4,Aggravated_Assault,3
12,Population_Log,4
8,Motor_Vehicle_Theft,5


**Next Steps:** Based on this information, let's remove some of these features (greater than ranking of 3) and run our models again.

In [171]:
#Redo our feature set, removing Murder and Arson
va_crime_features_revised = va_crime_features.drop(['Murder','Arson', 'Robbery', 'Motor_Vehicle_Theft', 'Population_Log'], 1)

**Regular Logistic Regression Model**

In [172]:
lr2 = LogisticRegression()
Y = va_crime['Rape']
X = va_crime_features_revised

# Fit the model.
fit = lr2.fit(X, Y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn2 = lr2.predict(X)

print('\n Accuracy')
print(pd.crosstab(pred_y_sklearn2, Y))

print('\n Percentage accuracy')
print(str(lr2.score(X, Y)*100) + '%')

Coefficients
[[1.43931676e-12 1.36713605e-14 7.06201725e-15 1.05936608e-13
  1.18467259e-14 9.05329733e-14 1.55662430e-08 1.27395495e-11]]
[-7.81976176e-16]

 Accuracy
Rape    0   1
row_0        
1      62  87

 Percentage accuracy
58.38926174496645%


**Lasso Logistic Regression Model**

In [174]:
lr_lasso2 = LogisticRegression(penalty='l1')
Y = va_crime['Rape']
X = va_crime_features_revised

# Fit the model.
fit = lr_lasso2.fit(X, Y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn_lasso2 = lr_lasso2.predict(X)

print('\n Accuracy')
print(pd.crosstab(pred_y_sklearn_lasso2, Y))

print('\n Percentage accuracy')
print(str(lr_lasso2.score(X, Y)*100) + '%')

Coefficients
[[ 3.19667171e-05  2.48820423e-01 -3.84128627e-02  1.96567539e-03
  -9.44310533e-04  2.26372038e-03 -1.49420717e-09  6.69236300e-08]]
[-1.53189483]

 Accuracy
Rape    0   1
row_0        
0      57  18
1       5  69

 Percentage accuracy
84.56375838926175%


In [185]:
#Cross validation for our second lasso model

lasso_score2 = cross_val_score(lr_lasso2, va_crime_features_revised, va_crime['Rape'], cv=10)
lasso_score2.mean()

0.8664285714285714

**Ridge Logistic Regression Model**

In [175]:
# Declare a logistic regression classifier, using penalty 'l2' to indicate ridge
lr_ridge2 = LogisticRegression(penalty='l2')
Y = va_crime['Rape']
X = va_crime_features_revised

# Fit the model.
fit = lr_ridge2.fit(X, Y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn_ridge2 = lr_ridge2.predict(X)

print('\n Accuracy')
print(pd.crosstab(pred_y_sklearn_ridge2, Y))

print('\n Percentage accuracy')
print(str(lr_ridge2.score(X, Y)*100) + '%')

Coefficients
[[1.43931676e-12 1.36713605e-14 7.06201725e-15 1.05936608e-13
  1.18467259e-14 9.05329733e-14 1.55662430e-08 1.27395495e-11]]
[-7.81976176e-16]

 Accuracy
Rape    0   1
row_0        
1      62  87

 Percentage accuracy
58.38926174496645%


## Evaluation of All Three Models

The first time through I used all features on all 3 models. They performed at the following success rates:
    Logistic - 58.39%
    Lasso  - 97.98%
    Ridge  - 58.39%

The second time through, I redid the models based on recursive feature selection findings. These were the results the second time:
    Logistic - 58.39%
    Lasso  - 84.56%
    Ridge  - 58.39%

That being said, by far the most successful model was the original lasso logistic regression model I created. I did see fluctuation during cross-validation so I am not sure the original 97.98% accuracy is trustworthy. During cross-validation, the mean score was 91.27%, which seems more legitimate. I am interested to learn why the logistic and ridge model versions performed right at the baseline accuracy for both iterations.