# Logistic Regression and Support Vector Machine
***

### Table of Contents<a id='TOC'></a>
---
1. <a href='#test-train'>Test/Train</a><br>
2. <a href='#models'>Models</a><br>
2.1 <a href='#logistic-regression'>Logistic Regression</a><br>
2.2 <a href='#support-vector-machine'>Support Vector Machine</a><br>
3. <a href='#model-analysis'>Model Analysis</a><br>
3.1 <a href='#logistic-regression-weights'>Logit Weights</a><br>
3.2 <a href='#support-vector'>Support Vectors</a><br>

### Libraries & Data <a id='libraries-data'></a>
---

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [2]:
# Libraries
# IMPORT LIBRARIES
# hide warnings
import warnings
warnings.filterwarnings('ignore')

# all imported libraries used for analysis
import numpy as np
import pandas as pd 
import os 
import urllib
import copy
import plotly 
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns 
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import statsmodels.api as sm
import random
import random
import us

from geopy.geocoders import Nominatim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.utils import resample
from sklearn.feature_selection import RFE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from datetime import datetime
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from pandas.plotting import scatter_matrix

# set color scheme and style for seaborn
sns.set(color_codes=True)
sns.set_style('whitegrid')

***REMOVE THE CODE CELL BELOW ONCE ALL HAVE GENERATED THE DATABASE_REV.CSV***

In [3]:
# Read the database.csv file and store in a dataframe
df=pd.read_csv('../Data/database.csv')
#-------------------------------------------------------------#
blank_index = df.loc[df['Perpetrator Age']== ' '].index.values[0]
df.at[blank_index, 'Perpetrator Age'] = '0'
#-------------------------------------------------------------#
df['Perpetrator Age'] = df['Perpetrator Age'].astype(int)
#-------------------------------------------------------------#
#Binning Age
age_bins = np.array([0,10,20,30,40,50,60,70,80,90,100,998])
age_labels = ['0-10', '11-20','21-30','31-40','41-50','51-60','61-70','71-80','81-90', '91-100', '998']
df["Victim_Age_Group"] = pd.cut(df['Victim Age'].astype(int), age_bins, labels=age_labels, include_lowest=True)
df['Perpetrator Age']=df['Perpetrator Age'].replace(to_replace=" ",value=0)
df['Perpetrator Age'] = df['Perpetrator Age'].astype(int)
df['Perpetrator_Age_Group'] = pd.cut(df['Perpetrator Age'].astype(int), age_bins, labels=age_labels, include_lowest=True)

#-------------------------------------------------------------#
# combine Victim and Perpetrator Race & Ethnicity into new features - Victim_Race_Ethnicity and Perpetrator_Race_Ethnicity
#df['Victim_Race_Ethnicity'] = df['Victim Race'] + ', ' + df['Victim Ethnicity']
#df['Victim_Race_Ethnicity'] = df['Victim_Race_Ethnicity'].str.replace(', Unknown','')
df['Perpetrator_Race_Ethnicity'] = df['Perpetrator Race'] + ', ' + df['Perpetrator Ethnicity']
df['Perpetrator_Race_Ethnicity'] = df['Perpetrator_Race_Ethnicity'].str.replace(', Unknown', '')

#-------------------------------------------------------------#
relationship_dict = {
    'Female Partner': ['Wife', 'Girlfriend', 'Ex-Wife', 'Common-Law Wife'],
    'Male Partner': ['Ex-Husband', 'Husband','Boyfriend', 'Common-Law Husband'],
    'Parent': ['Father','In-Law','Mother','Stepfather','Stepmother'],
    'Children': ['Daughter', 'Son', 'Stepdaughter','Stepson'],
    'Sibling': ['Brother', 'Sister'],
    'Work': ['Employee', 'Employer']
}
df['Relationship_Group'] = df['Relationship']
rel_replace = [[key for key, value in relationship_dict.items() if val in value][0] if len([key for key, value in relationship_dict.items() if val in value]) >0 else val for val in df['Relationship_Group']]
df['Relationship_Group'] = rel_replace
#-------------------------------------------------------------#
# data wrangling, clean-up, rename headers, drop columns, change data types, and transforms
# change crime solved values - Yes = 1 and No = 0 
df['Crime Solved']=df['Crime Solved'].replace(to_replace='No',value=0)
df['Crime Solved']=df['Crime Solved'].replace(to_replace='Yes',value=1)
#-------------------------------------------------------------#
df = df.drop(['Victim Count', 'Record Source'], axis=1)
df.to_csv ('../Data/database_rev.csv', index = False, header=True)

In [4]:
# Load in data
df = pd.read_csv('../Data/database_rev.csv')

In [5]:
df.head(3)

Unnamed: 0,Record ID,Agency Code,Agency Name,Agency Type,City,State,Year,Month,Incident,Crime Type,...,Perpetrator Age,Perpetrator Race,Perpetrator Ethnicity,Relationship,Weapon,Perpetrator Count,Victim_Age_Group,Perpetrator_Age_Group,Perpetrator_Race_Ethnicity,Relationship_Group
0,1,AK00101,Anchorage,Municipal Police,Anchorage,Alaska,1980,January,1,Murder or Manslaughter,...,15,Native American/Alaska Native,Unknown,Acquaintance,Blunt Object,0,11-20,11-20,Native American/Alaska Native,Acquaintance
1,2,AK00101,Anchorage,Municipal Police,Anchorage,Alaska,1980,March,1,Murder or Manslaughter,...,42,White,Unknown,Acquaintance,Strangulation,0,41-50,41-50,White,Acquaintance
2,3,AK00101,Anchorage,Municipal Police,Anchorage,Alaska,1980,March,2,Murder or Manslaughter,...,0,Unknown,Unknown,Unknown,Unknown,0,21-30,0-10,Unknown,Unknown


### 1. Test/Train Split <a id='test-train'></a>
---
***Walter***

In [6]:
# Train/Test Split 0.8/0.2
# set seed
random.seed(1234)
#-------------------------------------------------------------#
# split into train/test
y = df['Crime Solved']
x = df.drop(['Crime Solved'], axis = 1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

In [7]:
# Downsampling
#-------------------------------------------------------------#
# set seed
random.seed(1234)
#-------------------------------------------------------------#
# split into train/test
y = df['Crime Solved']
x = df.drop(['Crime Solved'], axis = 1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.8)
#-------------------------------------------------------------#
training_df = x_train
training_df['Crime Solved'] = y_train
print(f'Unbalanced data: {training_df.groupby("Crime Solved").count()["Record ID"]}')
#-------------------------------------------------------------#
# Separate majority and minority classes
fourty = round((len(df)*0.8)/2,)

df_majority = training_df[training_df['Crime Solved']==1]
df_minority = training_df[training_df['Crime Solved']==0]
      
# Downsample majority class
quarter = int(round(len(df)/4,0))
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_minority),     # to match minority class
                                 random_state=123) # reproducible results
 
# # Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
print(f'\n\nBalanced data: {df_downsampled.groupby("Crime Solved").count()["Record ID"]}')

Unbalanced data: Crime Solved
0    37996
1    89694
Name: Record ID, dtype: int64


Balanced data: Crime Solved
0    37996
1    37996
Name: Record ID, dtype: int64


In [8]:
# Final train/test split (80/20)
df_training = df_downsampled.reset_index()
index_train = df_training['index']
df_training = df_training.drop('index', axis=1)
#-------------------------------------------------------------#

# index_train.values
full_ind = df.index.values
train_ind = index_train.values
mask = np.isin(full_ind, train_ind, invert=True)
test_ind = full_ind[mask]
print('Actual no. of records: ' + str(len(test_ind))+', expected no. of records: ' + str(len(df) - len(index_train.values) ))
if len(test_ind) == (len(df) - len(index_train.values)):
    df_test = df.iloc[mask]
    train_amt = quarter*2
    test_amt = quarter
    print('Validation Complete')

Actual no. of records: 562462, expected no. of records: 562462
Validation Complete


## 2. Models <a id='models'></a>
---
Create a logistic regression model and a support vector machine model for the
classification task involved with your dataset. Assess how well each model performs (use
80/20 training/testing split for your data). Adjust parameters of the models to make them more
accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel
only is fine to use.

***Assumptions***<br>
-fill out all assumptions made-

### 2.1 Logistic Regression <a id='logistic-regression'></a>
***
***Thad & Jamie***

### 2.2 Support Vector Machine<a id='support-vector-machine'></a>
---
***Kris & Walter***

## 3. Model Analysis <a id='model-analysis'></a>
---
***TEAM***<br>
Discuss the advantages of each model for each classification task. Does one type
of model offer superior performance over another in terms of prediction accuracy? In terms of
training time or efficiency? Explain in detail. 

### 3.2 Logistic Regression Weights<a id='logistic-regression-weights'></a>
---
***Thad & Jamie***<br>
Use the weights from logistic regression to interpret the importance of different
features for each classification task. Explain your interpretation in detail. Why do you think
some variables are more important?

### 3.3 Support Vectors<a id='support-vectors'></a>
---
***Kris & Walter***<br>
Look at the chosen support vectors for the classification task. Do these provide
any insight into the data? Explain.