## DATA HACKERMAN FINAL PROJECT

### Part 4 _Model Creation_

This final task involves creating a predictive model for a response variable, given a set of features. The task is to create a predictive model for the variable ‘properties.sentiment’ using the remaining features in the data set.

- Use AutoGluon or your preferred algorithm.
- The data files attached should be used to create the model.  

This task is a blank canvas to work with. The only caveat is that you must be able to explain the methods and models you are using.

- What we would like to see from this task is your thoughts and decisions on training and testing a model. This will include, but not limited to, considering aspects such as 
    - feature selection & creation
    - parameter tuning of the model
    - train / validation / test split. 



In [1]:
import pandas as pd
import json
import os

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.options.display.max_colwidth = None
pd.set_option("display.float_format", lambda x: '%.2f' % x)

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

from data_ingestion.ingest import get_data
from parameters.params import combined_data_file_path, autogluon_params
from sklearn.model_selection import train_test_split
# from autogluon.tabular import TabularDataset, TabularPredictor
from model_building.build_model import autogluon_model_build

In [2]:
data = get_data(combined_data_file_path)
data.head(2)

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
0,1689,22566.0,False,Can't believe I'm missing Love Island 😩,GB,twitter,1.0,51.57,0.46
1,114,1377.0,False,Last tweet about future wedding..... if I actually want a wedding I actually need to find a guy XD we all know I'm a loner. unlovable,GB,twitter,1.0,52.97,-1.17


In [3]:
data.columns

Index(['author.properties.friends', 'author.properties.status_count',
       'author.properties.verified', 'content.body', 'location.country',
       'properties.platform', 'properties.sentiment', 'location.latitude',
       'location.longitude'],
      dtype='object')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 9 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   author.properties.friends       3000 non-null   object 
 1   author.properties.status_count  2999 non-null   float64
 2   author.properties.verified      3000 non-null   object 
 3   content.body                    2999 non-null   object 
 4   location.country                2999 non-null   object 
 5   properties.platform             2998 non-null   object 
 6   properties.sentiment            2999 non-null   float64
 7   location.latitude               2999 non-null   float64
 8   location.longitude              2999 non-null   float64
dtypes: float64(4), object(5)
memory usage: 211.1+ KB


In [5]:
data.shape

(3000, 9)

From the data exploration phase, we observed a row with `NaN` value under the `"properties.platform"` feature at `index:1551`. This value is replaced with `twitter` which is the value for all other rows in the dataset. Also, there are numerous missing values at `index:1552`. Which is dropped.

In [6]:
data[data["properties.platform"].isnull()]

Unnamed: 0,author.properties.friends,author.properties.status_count,author.properties.verified,content.body,location.country,properties.platform,properties.sentiment,location.latitude,location.longitude
1551,854,3688.0,False,@DrunkenOldQrow @FancyWeiss,GB,,1.0,52.05,-2.7
1552,|| TELL ME YOUR NAME! XD,,twitter,,,,,,


In [7]:
data = data.dropna()

In [8]:
data.shape

(2998, 9)

In [9]:
data["properties.sentiment"].value_counts()

-1.00    1403
0.00      968
1.00      627
Name: properties.sentiment, dtype: int64

In [10]:
# Model Building

train_data, test_data, predictor = (
    autogluon_model_build(data, autogluon_params)
)

Beginning AutoGluon training ... Time limit = 240s
AutoGluon will save models to "artifacts/models/"
AutoGluon Version:  0.7.0
Python Version:     3.8.16
Operating System:   Darwin
Platform Machine:   x86_64
Platform Version:   Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64
Train Data Rows:    2008
Train Data Columns: 8
Label Column: properties.sentiment
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    5300.09 MB
	Train Data (Original)  Memory Usage: 0.93 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generator

In [11]:
predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.72,0.13,5.07,0.0,0.16,2,True,8
1,XGBoost,-0.72,0.01,0.7,0.01,0.7,1,True,5
2,CatBoost,-0.73,0.01,1.4,0.01,1.4,1,True,3
3,ExtraTrees,-0.74,0.04,0.72,0.04,0.72,1,True,4
4,RandomForest,-0.74,0.04,1.28,0.04,1.28,1,True,2
5,NeuralNetTorch,-0.77,0.02,1.67,0.02,1.67,1,True,7
6,LinearModel,-0.79,0.05,0.42,0.05,0.42,1,True,6
7,KNeighbors,-0.87,0.07,2.5,0.07,2.5,1,True,1


In [12]:
predictor.leaderboard(test_data, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.73,-0.72,0.27,0.13,5.07,0.0,0.0,0.16,2,True,8
1,CatBoost,-0.73,-0.73,0.02,0.01,1.4,0.02,0.01,1.4,1,True,3
2,XGBoost,-0.73,-0.72,0.02,0.01,0.7,0.02,0.01,0.7,1,True,5
3,ExtraTrees,-0.73,-0.74,0.11,0.04,0.72,0.11,0.04,0.72,1,True,4
4,RandomForest,-0.74,-0.74,0.09,0.04,1.28,0.09,0.04,1.28,1,True,2
5,LinearModel,-0.78,-0.79,0.09,0.05,0.42,0.09,0.05,0.42,1,True,6
6,NeuralNetTorch,-0.79,-0.77,0.03,0.02,1.67,0.03,0.02,1.67,1,True,7
7,KNeighbors,-0.85,-0.87,0.02,0.07,2.5,0.02,0.07,2.5,1,True,1
