# Ex3 - Raz Bareli

### Q1)

In [128]:
import pandas as pd
import numpy as np
from scipy.stats import randint
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, make_scorer

In [129]:
df = pd.read_csv("ex3.csv")

We'll take only the 'safe' features we want to work with:

In [130]:
df = df[['state','date','congressional_district','gun_type','participant_gender', 'n_killed']]

In order to work with the data, we have to fill null values, and do some data engineering as we did in previous exercises, so that's what we'll do first.

In [131]:
# fill null values with mode as in ex1:
for column in df:
    df[column] = df[column].fillna(df[column].mode()[0])

In [132]:
# modify 'participant_gender' as in ex2
df.loc[df['participant_gender'].str.contains('Female', regex=True) & df['participant_gender'].str.contains('Male', regex=True), ['participant_gender']] = "Both"
df.loc[df['participant_gender'].str.contains('Female', regex=True), ['participant_gender']] = "Female"
df.loc[df['participant_gender'].str.contains('Male', regex=True), ['participant_gender']] = "Male"


In [133]:
# modify 'gun_type' as in ex2
df['Gun']=df.gun_type.str.extract('([A-Za-z]+|[0-9][mm]+)')
def combine_guns(x):
    if x in ["Handgun","9mm","0mm", "Win","Spl", "Spr"]:
        return 'Handgun'
    if x in ["Other", "Unknown"]:
        return 'Unknown'
    else:
        return 'Rifle'
df["Gun"] = df["Gun"].apply(lambda x:combine_guns(x))
df['gun_type'] = df['Gun']
df = df.drop(columns='Gun')

In [134]:
# modify date to year/month as in ex1
def delete_day(x):
    return x[:-3]
df['date'] = df['date'].apply(delete_day)

Now we can get to the model training part:

In [135]:
# create dummy variables
df = pd.get_dummies(df, columns=['state', 'date', 'gun_type', 'participant_gender'])

In [136]:
X = df.drop(columns=['n_killed'])
y = df['n_killed']

In [137]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

I'll choose 2 models:
1. Linear Regression
2. Random Forest

Both are Regression models since we are trying to predict an integer between 0 and inf.
We could, technically,  have taken a multiclass classifier in that case, but I don't think that it suits here
since there are hierarchies between the classes. That is, 10 killed are much more than 2 killed.
So that's why I've picked regression models.

For the metric I'll choose the MSE metric.
The advantage of MSE is that it gives different weights to large errors and small errors.
That is, a larger error in our prediction (say, we predicted 100 instead of 1) will increase the MSE more that a smaller
prediction error (if we predicted 2 instead of 1).

One disadvantage of this metric is that we can't really tell how many times we were wrong. For instance,
If we predict everything wrong, say: 1 instead of 0 and 0 instead of 1, the MSE won't be large, but obviously the prediction is very bad.
In this case, the accuracy metric would have an advantage since it would have told us that we are 100% wrong.

In [146]:
# linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_y_pred = lr.predict(X_test)
lr_mse =  mean_squared_error(y_test, lr_y_pred)
print("MSE for Linear Regression = ", lr_mse)

MSE for Linear Regression =  0.2303223268249139


In [147]:
# random forest
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr_y_pred = rfr.predict(X_test)
rfr_mse = mean_squared_error(y_test, rfr_y_pred)
print("MSE for Random Forest Regression = ", rfr_mse)

MSE for Random Forest Regression =  0.28917219718131376


As we can see, the errors are not too large and both models performed pretty much the same, when comparing with the MSE metric.
However, I think that since most of the n_killed data is either 0 or 1, most of the predictions are in that area as well,
And as I explained before, it's hard to tell just based of one metric if the predictions were good, since they can all be wrong
in the worst case, and still get a relatively low MSE.

### Q2)

For this question I'll continue with the Random Forest Regressor from Q1, so we can see if we can improve the model by
hyperparameter selection.

I'll use Grid Search and Random Search.

I've chosen 3 hyperparameters and 2 values for each, and cv=3 in cross validation - only because otherwise it would take too
long to run (even with these settings it can take up to 10 minutes on my PC).

Ideally, I would have chosen 4 parameters as the exersice suggests, and more values for each parameter, as well as cv=5 cross validation
which is the standard from what I understand.

I've chosen the MSE score, for the same reasons I've mentioned earlier, and so I'll be able to compare it to the results in Q1.

In [143]:
# Grid Search
gs_rfr = RandomForestRegressor()
params_grid = {
    'n_estimators': [10, 100],
    'max_depth': [20, 50],
    'min_samples_split': [5, 20],
}
n_combinations = np.array([len(l) for key, l in params_grid.items()]).prod() # to use in random grid search
gs = GridSearchCV(estimator=gs_rfr, param_grid=params_grid, verbose=3, cv=3, scoring=make_scorer(mean_squared_error, greater_is_better = False))
gs.fit(X_train, y_train)

print('Best parameters: ', gs.best_params_)
print('Best Score: ', abs(gs.best_score_))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3] END max_depth=20, min_samples_split=5, n_estimators=10;, score=-0.253 total time=   0.8s
[CV 2/3] END max_depth=20, min_samples_split=5, n_estimators=10;, score=-0.250 total time=   0.7s
[CV 3/3] END max_depth=20, min_samples_split=5, n_estimators=10;, score=-0.270 total time=   0.8s
[CV 1/3] END max_depth=20, min_samples_split=5, n_estimators=100;, score=-0.251 total time=   8.6s
[CV 2/3] END max_depth=20, min_samples_split=5, n_estimators=100;, score=-0.245 total time=   8.7s
[CV 3/3] END max_depth=20, min_samples_split=5, n_estimators=100;, score=-0.264 total time=   8.8s
[CV 1/3] END max_depth=20, min_samples_split=20, n_estimators=10;, score=-0.250 total time=   0.8s
[CV 2/3] END max_depth=20, min_samples_split=20, n_estimators=10;, score=-0.248 total time=   0.8s
[CV 3/3] END max_depth=20, min_samples_split=20, n_estimators=10;, score=-0.269 total time=   0.8s
[CV 1/3] END max_depth=20, min_samples_split=20, n_e

In [144]:
# Random Grid Search:
rgs_rfr = RandomForestRegressor()

ranges = {
    'n_estimators': randint(10, 100),
    'max_depth': randint(20, 50),
    'min_samples_split': randint(5, 20)
}
rgs = RandomizedSearchCV(estimator=rgs_rfr, param_distributions=ranges, verbose=3, cv=3,  scoring=make_scorer(mean_squared_error, greater_is_better = False), n_iter=n_combinations)
rgs.fit(X_train, y_train)

print('Best parameters: ', rgs.best_params_)
print('Best Score: ', abs(rgs.best_score_))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3] END max_depth=38, min_samples_split=11, n_estimators=87;, score=-0.259 total time=  12.4s
[CV 2/3] END max_depth=38, min_samples_split=11, n_estimators=87;, score=-0.254 total time=  12.3s
[CV 3/3] END max_depth=38, min_samples_split=11, n_estimators=87;, score=-0.277 total time=  11.2s
[CV 1/3] END max_depth=26, min_samples_split=12, n_estimators=99;, score=-0.251 total time=  10.7s
[CV 2/3] END max_depth=26, min_samples_split=12, n_estimators=99;, score=-0.248 total time=  10.5s
[CV 3/3] END max_depth=26, min_samples_split=12, n_estimators=99;, score=-0.268 total time=  10.6s
[CV 1/3] END max_depth=22, min_samples_split=10, n_estimators=49;, score=-0.251 total time=   4.6s
[CV 2/3] END max_depth=22, min_samples_split=10, n_estimators=49;, score=-0.247 total time=   4.7s
[CV 3/3] END max_depth=22, min_samples_split=10, n_estimators=49;, score=-0.266 total time=   4.7s
[CV 1/3] END max_depth=40, min_samples_split=16, 

Well, we can see that the random grid search found hyperparameters that gave better results on the test set.
As we discussed in class - that could happen since we are not limited to the values we gave the model in the first place,
and of course we need a bit of luck as well since choosing the parameter is randomly made.

Now just for fun lets compare the new fitted models to the one from question 1:


In [149]:
print("question 1 forest error: ", rfr_mse)
print("grid search forest error: ", mean_squared_error(y_test, gs.predict(X_test)))
print("random grid search forest error: ", mean_squared_error(y_test, rgs.predict(X_test)))

question 1 forest error:  0.28917219718131376
grid search forest error:  0.24052877578673204
random grid search forest error:  0.2416238346794654


And we can see that the hyperparamater optimization led to a better results!

### Q3.a)