<a href="https://colab.research.google.com/github/solvemate2018/CInema-XX/blob/main/ML_Electives_Mini_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I start by mounting my google drive to the project

In [1]:
from google.colab import drive
from tensorflow import keras
import numpy as np

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Then I read the csv file

In [2]:
import pandas as pd

dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DataSets/responses.csv')

Now I can check the info of the dataset to see what datatypes we work with and what eventual changes we should make.

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Columns: 150 entries, Music to House - block of flats
dtypes: float64(134), int64(5), object(11)
memory usage: 1.2+ MB


I am dropping about a hundred columns of data since I do not want to work with that huge amount of columns

In [4]:
dataset.drop(dataset.iloc[:, 0:100], inplace=True, axis=1)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 50 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Cheating in school              1006 non-null   float64
 1   Health                          1009 non-null   float64
 2   Changing the past               1008 non-null   float64
 3   God                             1008 non-null   float64
 4   Dreams                          1010 non-null   int64  
 5   Charity                         1007 non-null   float64
 6   Number of friends               1010 non-null   int64  
 7   Punctuality                     1008 non-null   object 
 8   Lying                           1008 non-null   object 
 9   Waiting                         1007 non-null   float64
 10  New environment                 1008 non-null   float64
 11  Mood swings                     1006 non-null   float64
 12  Appearence and gestures         10

Seing what I am left with I decided to merge some of the columns into one "Happiness Index" and to make it so that the model can predict it.

In [5]:
dataset["Happiness Index"] = (dataset["Happiness in life"] + dataset["Life struggles"] + dataset["Energy levels"] + dataset["Dreams"] + dataset["Health"]) / 5

dataset = dataset.drop(columns="Happiness in life")
dataset = dataset.drop(columns="Life struggles")
dataset = dataset.drop(columns="Energy levels")
dataset = dataset.drop(columns="Dreams")
dataset = dataset.drop(columns="Health")

Now I can analyze how much the other columns are affecting the Happiness Index and decide which one I want to keep.

In [6]:
corr_matrix = dataset.corr(numeric_only=True)
corr_matrix["Happiness Index"].sort_values(ascending=False)

Happiness Index                   1.000000
Parents' advice                   0.300817
Number of friends                 0.272518
Children                          0.259362
Interests or hobbies              0.239095
Spending on looks                 0.235009
Appearence and gestures           0.220655
Personality                       0.208418
Spending on healthy eating        0.196633
Shopping centres                  0.196179
Knowing the right people          0.193044
God                               0.189612
Socializing                       0.146122
Questionnaires or polls           0.121397
Charity                           0.112670
Getting angry                     0.108623
Assertiveness                     0.106802
Finding lost valuables            0.104353
New environment                   0.089235
Achievements                      0.075125
Cheating in school                0.062730
Unpopularity                      0.059191
Branded clothing                  0.055154
Number of s

In [7]:
dataset = dataset.drop(columns='Weight')
dataset = dataset.drop(columns='Height')
dataset = dataset.drop(columns='Changing the past')
dataset = dataset.drop(columns='Small - big dogs')
dataset = dataset.drop(columns='Waiting')

Removing the one with the least weight on the Happiness index I am left with the most important data for my model.

In [8]:
corr_matrix = dataset.corr(numeric_only=True)
corr_matrix["Happiness Index"].sort_values(ascending=False)

Happiness Index                   1.000000
Parents' advice                   0.300817
Number of friends                 0.272518
Children                          0.259362
Interests or hobbies              0.239095
Spending on looks                 0.235009
Appearence and gestures           0.220655
Personality                       0.208418
Spending on healthy eating        0.196633
Shopping centres                  0.196179
Knowing the right people          0.193044
God                               0.189612
Socializing                       0.146122
Questionnaires or polls           0.121397
Charity                           0.112670
Getting angry                     0.108623
Assertiveness                     0.106802
Finding lost valuables            0.104353
New environment                   0.089235
Achievements                      0.075125
Cheating in school                0.062730
Unpopularity                      0.059191
Branded clothing                  0.055154
Number of s

I continue my data preparation by removing all null values.

In [9]:
from sklearn.model_selection import train_test_split

dataset = dataset.dropna()

I split my dataset into dataset with predictors and dataset with the result values.

In [10]:
# Remove the labels from the Dataset.
happiness_predictors = dataset.drop(columns="Happiness Index")
# Keep the labels in a separate set.
happiness_labels = dataset["Happiness Index"].copy()

Now I make two pipelines for the numerical and the categorical datas to transform them into data that the model can use (Values between one and zero)

In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), MinMaxScaler())

In [12]:
from sklearn.preprocessing import OneHotEncoder

# Pipeline for the categorical attribute.
cat_pipeline = make_pipeline(SimpleImputer(strategy="most_frequent"), OneHotEncoder(sparse_output=False))

Now I can combine the two pipelines into one Preprocessing pipeline that can automatically scale all columns into usable data.

In [13]:
# Pipeline that will transform both the numerical and categorial attributes and combine them.

happiness_num = happiness_predictors.select_dtypes(include=[np.number])
happiness_categories = happiness_predictors.select_dtypes(exclude='number')
# We must pass the names of the attributes which should be transformed
num_attribs = list(happiness_num)
cat_attribs = list(happiness_categories)

preprocessing_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

happiness_predictors_prepared = preprocessing_pipeline.fit_transform(happiness_predictors)

Now I can split the prepared data into different training and testing sets.

In [14]:
X_train_full, X_test, y_train_full, y_test = train_test_split(happiness_predictors_prepared, happiness_labels, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

X_train.shape

(508, 57)

Now I can start preparing also my model. I amd using MLP (Multy Layer Perceptron) with early stopping (for lower overfitting) and learning rate scheduling (for avoiding Plateaus).

In [15]:
#Early stopping callback
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# Performance scheduling
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.1, patience=4)

I continue by creating a Sequential model that with multiple layers. One for the input, a few hidden layers and one output layer.

In [16]:
optimizer = keras.optimizers.SGD(momentum=0.9)

model = keras.models.Sequential([
    # input layer
    keras.layers.Input(shape=(57,)),
    # hidden layers
    keras.layers.Dense(70, activation="selu"),
    keras.layers.Dense(70, activation="selu"),
    keras.layers.Dense(70, activation="selu"),
    keras.layers.AlphaDropout(rate=0.1),
    # output layer
    keras.layers.Dense(1)
])

# Compile the model.
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])

Now I train the model with the data I prepared earlier and using the two callbacks

In [17]:
# Train the model.
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_valid, y_valid), callbacks=[lr_scheduler, early_stopping_cb])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100


Now let's evaluate the results.

In [18]:
loss, accuracy = model.evaluate(X_test, y_test)



We get accuracy of about 60%, which is not enough for the model to be actively used.

In [19]:
accuracy

0.6337803602218628

In [28]:
y_test.head()

75     3.2
495    3.6
236    4.2
270    3.4
41     3.2
Name: Happiness Index, dtype: float64

In [32]:
model.predict(X_test[:5])



array([[3.8346434],
       [3.9206762],
       [4.1353016],
       [3.9819272],
       [3.7375906]], dtype=float32)

ALthough the accuracy is about 60% we can see that the difference between the prediction and the actual value is less then 0.5 which makes it still usable with some disclaimers.