# Ex3 - Raz Bareli

### Q1)

In [41]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [42]:
df = pd.read_csv("ex3.csv")

We'll take only the 'safe' features we want to work with:

In [43]:
df = df[['state','date','congressional_district','gun_type','participant_gender', 'n_killed']]

In order to work with the data, we have to fill null values, and do some data engineering as we did in previous exercises, so that's what we'll do first.

In [44]:
# fill null values with mode as in ex1:
for column in df:
    df[column] = df[column].fillna(df[column].mode()[0])

In [45]:
# modify 'participant_gender' as in ex2
df.loc[df['participant_gender'].str.contains('Female', regex=True) &                         df['participant_gender'].str.contains('Male', regex=True), ['participant_gender']] = "Both"
df.loc[df['participant_gender'].str.contains('Female', regex=True), ['participant_gender']] = "Female"
df.loc[df['participant_gender'].str.contains('Male', regex=True), ['participant_gender']] = "Male"


In [46]:
# modify 'gun_type' as in ex2
df['Gun']=df.gun_type.str.extract('([A-Za-z]+|[0-9][mm]+)')
def combine_guns(x):
    if x in ["Handgun","9mm","0mm", "Win","Spl", "Spr"]:
        return 'Handgun'
    if x in ["Other", "Unknown"]:
        return 'Unknown'
    else:
        return 'Rifle'
df["Gun"] = df["Gun"].apply(lambda x:combine_guns(x))
df['gun_type'] = df['Gun']
df = df.drop(columns='Gun')

In [47]:
# modify date to year/month as in ex1
def delete_day(x):
    return x[:-3]
df['date'] = df['date'].apply(delete_day)

Now we can get to the model training part:

In [48]:
# create dummy variables
df = pd.get_dummies(df, columns=['state', 'date', 'gun_type', 'participant_gender'])

In [49]:
X = df.drop(columns=['n_killed'])
y = df['n_killed']

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

I'll choose 2 models:
1. Linear Regression
2. Random Forest
Both are Regression model since we are trying to predict an integer between 0 and inf.
We could, technically,  have taken a multiclass classifier in that case, but I don't think that it suits here
since there are hierarchies between the classes. That is, 10 killed are much more than 2 killed.
So that's why I've picked regression models.

For the metric I'll choose the MSE metric.
The advantage of MSE is that it gives different weights to large errors and small errors.
That is, a larger error in our prediction (say, we predicted 100 instead of 1) will increase the MSE more that a smaller
prediction error (if we predicted 2 instead of 1).

In [56]:
# linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_y_pred = lr.predict(X_test)
lr_mse =  mean_squared_error(y_test, lr_y_pred)
print("MSE for Linear Regression = ", lr_mse)

MSE for Linear Regression =  0.23443169471174108


In [57]:
# random forest
rfr = RandomForestRegressor(min_samples_split=20)
rfr.fit(X_train, y_train)
rfr_y_pred = rfr.predict(X_test)
rfr_mse = mean_squared_error(y_test, rfr_y_pred)
print("MSE for Random Forest Regression = ", rfr_mse)

MSE for Random Forest Regression =  0.2556210107386696
