# 4 Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
      * [4.4.1 Checking Null Values](#4.4.1_Checking_Null_Values)
      * [4.4.2 Checking Categorical Variable and their unique values](#4.4.2_Checking_Categorical_Variable_and_their_unique_values)
          * [4.4.2.1 Reducing Conflicting Shapes by changing all to lowercase](#4.4.2.1_Reducing_Conflicting_Shapes_by_changing_all_to_lowercase)
          * [4.4.2.2 Reducing Country labels by categorising all countries with less than 100 sightings to OTHER](#4.4.2.2_Reducing_Country_labels_by_categorising_all_countries_with_less_than_100_sightings_to_OTHER)
          * [4.4.2.3 Extracting Month, Day & hour from Datetime column & dropping it](#Extracting_Month,_Day_&_hour_from_Datetime_column_&_dropping_it)
      * [4.4.3 Creating datasets for Modelling](#4.4.3_Creating_datasets_for_Modelling)
          * [4.4.3.1 Dataset 1 All Features](#4.4.3.1_Dataset_1_All_Features)
          * [4.4.3.2 Dataset 2 - without City & State Columns](#4.4.3.2_Dataset_2_-_without_City_&_State_Columns)
  * [4.5 Encoding Categorical Features](#4.5_Encoding_Categorical_Features)
      * [4.5.1 For Dataset 1](#4.5.1_For_Dataset_1)
      * [4.5.1 For Dataset 2](#4.5.2_For_Dataset_2)
  * [4.6 Train/Test Split](#4.6_Train/Test_Split)
      * [4.6.1 For Dataset 1](#4.6.1_For_Dataset_1)
      * [4.6.2 For Dataset 2](#4.6.2_For_Dataset_2)
  * [4.7  Metrics - User Defined Functions / SK Learn](#4.7_Metrics_-_User_Defined_Functions_/_SK_Learn)
      * [4.7.1 R-squared, or coefficient of determination](#4.7.1_R-squared,_or_coefficient_of_determination)
      * [4.7.2 Mean Absolute Error](#4.7.2_Mean_Absolute_Error)
      * [4.7.3 Mean Squared Error](#4.7.3_Mean_Squared_Error)
  * [4.8 Scaling Data](#4.8_Scaling_Data)
  * [4.9 Training Model on Data](#4.9_Training_Model_on_Data)
      * [4.9.1 Linear Regression Model](#4.9.1_Linear_Regression_Model)
      * [4.9.2 StatsModel OLS](#4.9.2_StatsModel_OLS)
  
  

## 4.2 Introduction<a id='4.2_Introduction'></a>

In preceding notebooks, performed preliminary assessments of data quality and refined the question to be answered. You found a small number of data values that gave clear choices about whether to replace values or drop a whole row. You determined that predicting the adult weekend ticket price was your primary aim. You threw away records with missing price data, but not before making the most of the other available data to look for any patterns between the states. You didn't see any and decided to treat all states equally; the state label didn't seem to be particularly useful. 

We explore UFO Sightings data by cleaning and transforminf it and also visualized some relationships between shape, duration, datetime and location

Our purpose is to predict duration using various regression models and test their assessment as to which gives us the best prediction


## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

from library.sb_utils import save_file

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In [2]:
data = pd.read_csv('ufo_c.csv', parse_dates = ['Date_time', 'Date_posted'], low_memory = False)

In [3]:
data.head()

Unnamed: 0,Date_time,Duration_minutes,Description,Date_posted,lat_long,Country,State,City,Shape_final,Year,Month,sh,lat,long
0,1949-10-10 20:30:00,45.0,This event took place in early fall around 194...,2004-04-27,"(29.8830556, -97.9411111)",US,Texas,San Marcos,['cylinder'],1949,October,cylinder,29.883056,-97.941111
1,1949-10-10 21:00:00,60.0,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,"(29.38421, -98.581082)",US,Texas,Lackland Air Force Base,['light'],1949,October,light,29.38421,-98.581082
2,1955-10-10 17:00:00,0.33,Green/Orange circular disc over Chester&#44 En...,2008-01-21,"(53.2, -2.916667)",GB,England,Blacon,['circle'],1955,October,circle,53.2,-2.916667
3,1956-10-10 21:00:00,30.0,My older brother and twin sister were leaving ...,2004-01-17,"(28.9783333, -96.6458333)",US,Texas,Edna,['circle'],1956,October,circle,28.978333,-96.645833
4,1960-10-10 20:00:00,15.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,"(21.4180556, -157.8036111)",US,Hawaii,Kane'ohe,['light'],1960,October,light,21.418056,-157.803611


### 4.4.1 Checking Null Values<a id='4.4.1_Checking_Null_Values'></a>

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77544 entries, 0 to 77543
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Date_time         77544 non-null  datetime64[ns]
 1   Duration_minutes  77544 non-null  float64       
 2   Description       77544 non-null  object        
 3   Date_posted       77544 non-null  datetime64[ns]
 4   lat_long          77544 non-null  object        
 5   Country           77544 non-null  object        
 6   State             77544 non-null  object        
 7   City              77544 non-null  object        
 8   Shape_final       77544 non-null  object        
 9   Year              77544 non-null  int64         
 10  Month             77544 non-null  object        
 11  sh                77544 non-null  object        
 12  lat               77544 non-null  float64       
 13  long              77544 non-null  float64       
dtypes: datetime64[ns](2), 

In [5]:
data_m = data[['Date_time', 'Duration_minutes', 'Country', 'State', 'City', 'Year', 'Month', 'sh', 'lat','long']]

In [6]:
data_m.head()

Unnamed: 0,Date_time,Duration_minutes,Country,State,City,Year,Month,sh,lat,long
0,1949-10-10 20:30:00,45.0,US,Texas,San Marcos,1949,October,cylinder,29.883056,-97.941111
1,1949-10-10 21:00:00,60.0,US,Texas,Lackland Air Force Base,1949,October,light,29.38421,-98.581082
2,1955-10-10 17:00:00,0.33,GB,England,Blacon,1955,October,circle,53.2,-2.916667
3,1956-10-10 21:00:00,30.0,US,Texas,Edna,1956,October,circle,28.978333,-96.645833
4,1960-10-10 20:00:00,15.0,US,Hawaii,Kane'ohe,1960,October,light,21.418056,-157.803611


In [7]:
data_m.isnull().sum()

Date_time           0
Duration_minutes    0
Country             0
State               0
City                0
Year                0
Month               0
sh                  0
lat                 0
long                0
dtype: int64

### 4.4.2 Checking Categorical Variable and their unique values<a id='4.4.2_Checking_Categorical_Variable_and_their_unique_values'></a>

In [8]:
data_m['Country'].nunique()

157

In [9]:
data_m['State'].nunique()

796

In [10]:
data_m['sh'].unique()

array(['cylinder', 'light', 'circle', 'sphere', 'Circle', 'Changing',
       'Disk', 'disk', 'Round', 'fireball', 'unknown', 'Light', 'Sphere',
       'other', 'oval', 'cigar', 'rectangle', 'chevron', 'triangle',
       'formation', 'Chevron', 'Triangle', 'delta', 'changing', 'Cigar',
       'Egg', 'flash', 'Oval', 'Cylinder', 'Flash', 'Formation', 'cross',
       'Crescent', 'Fireball', 'Cross', 'Diamond', 'teardrop', 'egg',
       'Rectangle', 'Changed', 'diamond', 'Delta', 'cone', 'Cone',
       'Hexagon', 'Flare', 'Teardrop', 'Pyramid', 'pyramid', 'flare',
       'round'], dtype=object)

#### 4.4.2.1 Reducing Conflicting Shapes by changing all to lowercase<a id='4.4.2.1_Reducing_Conflicting_Shapes_by_changing_all_to_lowercase'></a>

In [11]:
data_m.loc[:,'sh'] = data_m['sh'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [12]:
data_m['sh'].unique()

array(['cylinder', 'light', 'circle', 'sphere', 'changing', 'disk',
       'round', 'fireball', 'unknown', 'other', 'oval', 'cigar',
       'rectangle', 'chevron', 'triangle', 'formation', 'delta', 'egg',
       'flash', 'cross', 'crescent', 'diamond', 'teardrop', 'changed',
       'cone', 'hexagon', 'flare', 'pyramid'], dtype=object)

In [13]:
data_m['sh'].value_counts()

light        23358
triangle      7267
circle        5460
fireball      5201
disk          4626
sphere        4562
other         4415
unknown       4039
oval          3169
formation     2988
cigar         1901
changing      1780
round         1203
cylinder      1141
rectangle     1133
flash         1066
diamond       1041
chevron        795
egg            632
teardrop       605
changed        427
cross          270
cone           261
crescent        62
flare           59
delta           44
pyramid         28
hexagon         11
Name: sh, dtype: int64

In [14]:
data_m['sh'].nunique()

28

#### 4.4.2.2 Reducing Country labels by categorising all countries with less than 100 sightings to OTHER<a id='4.4.2.2_Reducing_Country_labels_by_categorising_all_countries_with_less_than_100_sightings_to_OTHER'></a>

In [15]:
counts = data_m['Country'].value_counts()

In [16]:
mask_1 = data_m['Country'].isin(counts[counts < 100].index)

In [17]:
data_m['Country'][mask_1] = 'OTHER'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)


In [18]:
data_m['Country'].unique()

array(['US', 'GB', 'OTHER', 'GH', 'CA', 'AU', 'MX', 'IN', 'DE', 'NL'],
      dtype=object)

In [19]:
data_m['State'].nunique()

796

In [20]:
data_m['City'].nunique()

10444

In [21]:
data_m['Month'] = data_m['Date_time'].dt.month 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


#### 4.4.2.3 Extracting Month, Day & hour from Datetime column & dropping it<a id='4.4.2.3_Extracting_Month,_Day_&_hour_from_Datetime_column_&_dropping_it'></a>

In [22]:
data_m['Day'] = data_m['Date_time'].dt.day

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [23]:
data_m['Hour'] = data_m['Date_time'].dt.hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [24]:
data_m = data_m.drop(['Date_time'], axis = 1) 

In [25]:
data_m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77544 entries, 0 to 77543
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Duration_minutes  77544 non-null  float64
 1   Country           77544 non-null  object 
 2   State             77544 non-null  object 
 3   City              77544 non-null  object 
 4   Year              77544 non-null  int64  
 5   Month             77544 non-null  int64  
 6   sh                77544 non-null  object 
 7   lat               77544 non-null  float64
 8   long              77544 non-null  float64
 9   Day               77544 non-null  int64  
 10  Hour              77544 non-null  int64  
dtypes: float64(3), int64(4), object(4)
memory usage: 6.5+ MB


### 4.4.3 Creating datasets for Modelling<a id='4.4.3_Creating_datasets_for_Modelling'></a>

#### 4.4.3.1 Dataset 1 All Features<a id='4.4.3.1_Dataset_1_All_Features'></a>

In [26]:
data_m.head()

Unnamed: 0,Duration_minutes,Country,State,City,Year,Month,sh,lat,long,Day,Hour
0,45.0,US,Texas,San Marcos,1949,10,cylinder,29.883056,-97.941111,10,20
1,60.0,US,Texas,Lackland Air Force Base,1949,10,light,29.38421,-98.581082,10,21
2,0.33,GB,England,Blacon,1955,10,circle,53.2,-2.916667,10,17
3,30.0,US,Texas,Edna,1956,10,circle,28.978333,-96.645833,10,21
4,15.0,US,Hawaii,Kane'ohe,1960,10,light,21.418056,-157.803611,10,20


#### 4.4.3.2 Dataset 2 - without City & State Columns<a id='4.4.3.2_Dataset_2_-_without_City_&_State_Columns'></a>

In [27]:
data_1 = data_m.drop(['State', 'City'], axis = 1)

In [28]:
data_1.head()

Unnamed: 0,Duration_minutes,Country,Year,Month,sh,lat,long,Day,Hour
0,45.0,US,1949,10,cylinder,29.883056,-97.941111,10,20
1,60.0,US,1949,10,light,29.38421,-98.581082,10,21
2,0.33,GB,1955,10,circle,53.2,-2.916667,10,17
3,30.0,US,1956,10,circle,28.978333,-96.645833,10,21
4,15.0,US,1960,10,light,21.418056,-157.803611,10,20


In [29]:
data_1.shape

(77544, 9)

## 4.5 Encoding Categorical Features<a id='4.5_Encoding_Categorical_Features'></a>

### 4.5.1 For Dataset 1<a id='4.5.1_For_Dataset_1'></a>

In [30]:
# Using get_dummies method from pandas

X_1 = data_m[['Year','Month','sh','lat','long','Country','State','City','Day','Hour']]
y_1 = data_m['Duration_minutes']

In [31]:
X_1 = pd.get_dummies(data = X_1, drop_first = True)

### 4.5.2 For Dataset 2<a id='4.5.2_For_Dataset_2'></a>

In [32]:
X_2 = data_1[['Country','Year','Month','sh','lat','long','Day','Hour']]
y_2 = data_m['Duration_minutes']

In [33]:
X_2 = pd.get_dummies(data = X_2, drop_first = True)

## 4.6 Train/Test Split<a id='4.6_Train/Test_Split'></a>

In machine learning, when you train your model on all of your data, you end up with no data set aside to evaluate model performance. You could keep making more and more complex models that fit the data better and better and not realise you were overfitting to that one set of samples. By partitioning the data into training and testing splits, without letting a model (or missing-value imputation) learn anything about the test split, you have a somewhat independent assessment of how your model might perform in the future. An often overlooked subtlety here is that people all too frequently use the test set to assess model performance _and then compare multiple models to pick the best_. This means their overall model selection process is  fitting to one specific data set, now the test split. You could keep going, trying to get better and better performance on that one data set, but that's  where cross-validation becomes especially useful. While training models, a test split is very useful as a final check on expected future performance.

### 4.6.1 For Dataset 1<a id='4.6.1_For_Dataset_1'></a>

In [34]:
X_1_train, X_1_test, y_1_train, y_1_test = train_test_split(X_1, y_1, test_size = .25, random_state = 40)

### 4.6.2 For Dataset 2<a id='4.6.2_For_Dataset_2'></a>

In [35]:
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y_2, test_size = .25, random_state = 40)

## 4.7  Metrics - User Defined Functions / SK Learn<a id='4.7_Metrics_-_User_Defined_Functions_/_SK_Learn'></a>

### 4.7.1 R-squared, or coefficient of determination<a id='4.7.1_R-squared,_or_coefficient_of_determination'></a>

One measure is $R^2$, the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination). This is a measure of the proportion of variance in the dependent variable (our ticket price) that is predicted by our "model". The linked Wikipedia articles gives a nice explanation of how negative values can arise. This is frequently a cause of confusion for newcomers who, reasonably, ask how can a squared value be negative?

Recall the mean can be denoted by $\bar{y}$, where

$$\bar{y} = \frac{1}{n}\sum_{i=1}^ny_i$$

and where $y_i$ are the individual values of the dependent variable.

The total sum of squares (error), can be expressed as

$$SS_{tot} = \sum_i(y_i-\bar{y})^2$$

The above formula should be familiar as it's simply the variance without the denominator to scale (divide) by the sample size.

The residual sum of squares is similarly defined to be

$$SS_{res} = \sum_i(y_i-\hat{y})^2$$

where $\hat{y}$ are our predicted values for the depended variable.

The coefficient of determination, $R^2$, here is given by

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Putting it into words, it's one minus the ratio of the residual variance to the original variance. Thus, the baseline model here, which always predicts $\bar{y}$, should give $R^2=0$. A model that perfectly predicts the observed values would have no residual error and so give $R^2=1$. Models that do worse than predicting the mean will have increased the sum of squares of residuals and so produce a negative $R^2$.

<b> User Defined </b>

In [None]:
#Code task 6#
#Calculate the R^2 as defined above
def r_squared(y, ypred):
    """R-squared score.
    
    Calculate the R-squared, or coefficient of determination, of the input.
    
    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    ybar = np.sum(y) / len(y) #yes, we could use np.mean(y)
    sum_sq_tot = np.square((y - ybar)**2) #total sum of squares error
    sum_sq_res = np.square((y - ypred)**2) #residual sum of squares error
    R2 = 1.0 - sum_sq_res / sum_sq_tot
    return R2

#Make your predictions by creating an array of length the size of the training set with the single value of the mean.

<b> SK Learn </b>

In [None]:
r2_score(y , y_tr_pred)

$R^2$ is a common metric, and interpretable in terms of the amount of variance explained, it's less appealing if you want an idea of how "close" your predictions are to the true values. Metrics that summarise the difference between predicted and actual values are _mean absolute error_ and _mean squared error_.

### 4.7.2 Mean Absolute Error<a id='4.7.2_Mean_Absolute_Error'></a>

This is very simply the average of the absolute errors:

$$MAE = \frac{1}{n}\sum_i^n|y_i - \hat{y}|$$

<b> User Defined </b>

In [None]:
#Code task 7#
#Calculate the MAE as defined above
def mae(y, ypred):
    """Mean absolute error.
    
    Calculate the mean absolute error of the arguments

    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    abs_error = np.abs(y - ypred)
    mae = np.mean(abs_error)
    return mae

<b> SK Learn </b>

In [None]:
mean_absolute_error(y , y_pred)

Mean absolute error is arguably the most intuitive of all the metrics, this essentially tells you that, on average, you might expect to be off by around \\$19 if you guessed ticket price based on an average of known values.

### 4.7.3 Mean Squared Error<a id='4.7.3_Mean_Squared_Error'></a>

Another common metric (and an important one internally for optimizing machine learning models) is the mean squared error. This is simply the average of the square of the errors:

$$MSE = \frac{1}{n}\sum_i^n(y_i - \hat{y})^2$$

<b> User Defined </b>

In [None]:
#Code task 8#
#Calculate the MSE as defined above
def mse(y, ypred):
    """Mean square error.
    
    Calculate the mean square error of the arguments

    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    sq_error = (y - ypred)**2
    mse = np.mean(sq_error)
    return mse

<b> SK Learn </b>

In [None]:
mean_squared_error(y , y_tr_pred)

## 4.8  Scaling Data<a id='4.8_Scaling_Data'></a>

In [41]:
scaler = StandardScaler()
scaler.fit(X_2_train)
X_2_tr_scaled = scaler.transform(X_2_train)
X_2_te_scaled = scaler.transform(X_2_test)

## 4.9 Training Model on Data<a id='4.9_Training_Model_on_Data'></a>

### 4.9.1 Linear Regression Model<a id='4.9.1_Linear_Regression_Model'></a>

#### 4.9.1.1 Train Model on Train Split without Scaling<a id='4.8.1.1_Train_Model_on_Train_Split_without_Scaling'></a>

In [36]:
lm = LinearRegression()
lm.fit(X_2_train, y_2_train)

LinearRegression()

In [37]:
y_2_tr_pred = lm.predict(X_2_train)
y_2_te_pred = lm.predict(X_2_test)

In [58]:
lm.score(X_2_test, y_2_te_pred)

1.0

In [38]:
r2_score(y_2_train, y_2_tr_pred), r2_score(y_2_test, y_2_te_pred)

(0.004243870520794779, 0.00378264583083443)

In [39]:
mean_absolute_error(y_2_train, y_2_tr_pred), mean_absolute_error(y_2_test, y_2_te_pred)

(18.75776326576067, 19.81914073070893)

#### 4.9.1.2 Train Model on Train Split with Scaling<a id='4.9.1.2_Train_Model_on_Train_Split_with_Scaling'></a>

In [42]:
lm_2 = LinearRegression()
lm_2.fit(X_2_tr_scaled, y_2_train)

LinearRegression()

In [43]:
y_2_tr_sc_pred = lm.predict(X_2_tr_scaled)
y_2_te_sc_pred = lm.predict(X_2_te_scaled)

In [47]:
r2_score(y_2_train, y_2_tr_sc_pred), r2_score(y_2_test, y_2_te_sc_pred)

(-39.65812828048206, -24.058947651934194)

In [48]:
mean_absolute_error(y_2_train, y_2_tr_sc_pred), mean_absolute_error(y_2_test, y_2_te_sc_pred)

(425.32967551414123, 425.6837535440549)

### 4.9.2 StatsModel OLS<a id='4.9.2_StatsModel_OLS'></a>

In [50]:
from statsmodels.formula.api import ols

fit = ols('Duration_minutes ~ C(Country) + Year + Month + C(sh) + lat + long + Day + Hour', data=data_1).fit() 


In [51]:
fit.summary()

0,1,2,3
Dep. Variable:,Duration_minutes,R-squared:,0.004
Model:,OLS,Adj. R-squared:,0.004
Method:,Least Squares,F-statistic:,8.185
Date:,"Sun, 23 May 2021",Prob (F-statistic):,7.06e-49
Time:,17:55:34,Log-Likelihood:,-442380.0
No. Observations:,77544,AIC:,884900.0
Df Residuals:,77501,BIC:,885200.0
Df Model:,42,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,519.5124,51.782,10.033,0.000,418.020,621.005
C(Country)[T.CA],1.8758,5.123,0.366,0.714,-8.165,11.917
C(Country)[T.DE],5.7447,7.820,0.735,0.463,-9.581,21.071
C(Country)[T.GB],0.6273,4.838,0.130,0.897,-8.856,10.110
C(Country)[T.GH],4.6511,4.143,1.123,0.262,-3.469,12.772
C(Country)[T.IN],7.1857,6.216,1.156,0.248,-4.997,19.369
C(Country)[T.MX],12.9614,6.786,1.910,0.056,-0.339,26.262
C(Country)[T.NL],0.8329,8.370,0.100,0.921,-15.572,17.238
C(Country)[T.OTHER],2.4902,4.139,0.602,0.547,-5.623,10.603

0,1,2,3
Omnibus:,207927.282,Durbin-Watson:,1.987
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8434588957.859
Skew:,32.619,Prob(JB):,0.0
Kurtosis:,1617.393,Cond. No.,398000.0


### 4.9.3 Lasso Regression<a id='4.9.3_Lasso_Regression'></a>

In [63]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 1)
lasso.fit(X_2_train, y_2_train)

Lasso(alpha=1)

In [64]:
y_2_tr_lasso_pred = lasso.predict(X_2_train)
y_2_te_lasso_pred = lasso.predict(X_2_test)

In [65]:
r2_score(y_2_train, y_2_tr_lasso_pred), r2_score(y_2_test, y_2_te_lasso_pred)

(0.0015857632207292305, 0.0017492023121261635)

In [66]:
mean_absolute_error(y_2_train, y_2_tr_lasso_pred), mean_absolute_error(y_2_test, y_2_te_lasso_pred)

(18.868996274436256, 19.99531513815915)