# Machine Learning Project - Megaline

## Introduction

Today I will be doing a project on creating a model for Megaline's customers to recommend the newer plans to the users who still uses their legacy plans. The purpose of this project is to put the new skills learned about Machine Learning to the test.

Since the pre-processing of the data has been done in the previous chapters, I will go straight to creating the model.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
# Loading the dataset
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


## Creating and training the models

Now I'm going to try out different models and find the best one for this dataset

In [5]:
# Creating features and target
features = df.drop(['is_ultra'], axis=1)
target = df.is_ultra

In [6]:
# Splitting the datasets

# Training 70%, remaining 30%, which is a pretty reasonable amount of split
features_train, remaining_features, target_train, remaining_target = train_test_split(
    features, target, test_size=0.3, random_state=321)

# Remaining to 50/50 validation and test set
features_valid, features_test, target_valid, target_test = train_test_split(
    remaining_features, remaining_target, test_size=0.5, random_state=321)

In [7]:
# Creating a function to calculate best scores for a model
def model_training(model):
    predictions_train = model.predict(features_train)
    predictions_acc_train = accuracy_score(target_train, predictions_train)
    
    # Evaluating the model using validation set
    predictions_val = model.predict(features_valid)
    predictions_acc_val = accuracy_score(target_valid, predictions_val)
    
    # Evaluate on test data
    predictions_test = model.predict(features_test)
    predictions_acc_test = accuracy_score(target_test, predictions_test)
    
    return np.round([predictions_acc_train, predictions_acc_val, predictions_acc_test], 2)
    

In [8]:
# Trying first model of Decision Tree CLassifier

print("Decision Tree Classifier")
# Cycling through possible depths
for depth in range(1, 6):
    model_dec_tree = DecisionTreeClassifier(random_state=321, max_depth=depth)

    # Fitting the model to training set
    model_dec_tree.fit(features_train, target_train)
    
    # Running the function to get accuracy score
    scores = model_training(model_dec_tree)
    
    print("Depth:", depth)
    print("Score Train:", scores[0])
    print("Score Valid:", scores[1])
    print("Score Test:", scores[2])
    print("")

Decision Tree Classifier
Depth: 1
Score Train: 0.76
Score Valid: 0.71
Score Test: 0.72

Depth: 2
Score Train: 0.79
Score Valid: 0.75
Score Test: 0.76

Depth: 3
Score Train: 0.81
Score Valid: 0.77
Score Test: 0.77

Depth: 4
Score Train: 0.81
Score Valid: 0.78
Score Test: 0.77

Depth: 5
Score Train: 0.82
Score Valid: 0.79
Score Test: 0.77



Using the Decision Tree Classifier model, I was able to achieve around 77-82% in the accuracy of the predictions with the best parameter on the depth being 5. The accuracy scores seem to be around the same values, which doesn't show signs of overfitting. This has passed the threshold of accuracy, however I still need to compare it with other models.

In [9]:
# Creating second model of Random Forest Classifier
print("Random Forest Classifier")

# Getiing scores from up to 10 trees for best hyperparameter tuning
for est in range(1, 11):
    model_rand_for = RandomForestClassifier(random_state=321, n_estimators=est)
    
    # Fitting model to training set
    model_rand_for.fit(features_train, target_train)
    
     # Running the function to get accuracy score
    scores = model_training(model_rand_for)
    
    print("No. of Trees:", est)
    print("Score Train:", scores[0])
    print("Score Valid:", scores[1])
    print("Score Test:", scores[2])
    print("")

Random Forest Classifier
No. of Trees: 1
Score Train: 0.9
Score Valid: 0.72
Score Test: 0.72

No. of Trees: 2
Score Train: 0.91
Score Valid: 0.76
Score Test: 0.75

No. of Trees: 3
Score Train: 0.95
Score Valid: 0.74
Score Test: 0.73

No. of Trees: 4
Score Train: 0.95
Score Valid: 0.76
Score Test: 0.77

No. of Trees: 5
Score Train: 0.97
Score Valid: 0.76
Score Test: 0.77

No. of Trees: 6
Score Train: 0.97
Score Valid: 0.77
Score Test: 0.78

No. of Trees: 7
Score Train: 0.99
Score Valid: 0.76
Score Test: 0.77

No. of Trees: 8
Score Train: 0.97
Score Valid: 0.77
Score Test: 0.78

No. of Trees: 9
Score Train: 0.99
Score Valid: 0.76
Score Test: 0.78

No. of Trees: 10
Score Train: 0.98
Score Valid: 0.76
Score Test: 0.79



From the Random Forest model, we can see that the best amount of trees are either 6 or 8, as it has the highest scores on the valid and test, with both values not being too different from each other and while it may show signs of overfitting, the accuracy is high enough to pass the minimum score.

In [10]:
# Creating 3rd model, Logistic Regression

# Initializing the model with the liblinear solver
model_log_reg = LogisticRegression(random_state=321, solver='liblinear')

# Fitting the model to training set
model_log_reg.fit(features_train, target_train)

predict_train = model_log_reg.predict(features_train)
predict_valid = model_log_reg.predict(features_valid)
predict_test = model_log_reg.predict(features_test)

train_acc = accuracy_score(target_train, predict_train)
valid_acc = accuracy_score(target_valid, predict_valid)
test_acc = accuracy_score(target_test, predict_test)

print("Logistic Regressor")
print("Score Train:", train_acc.round(2))
print("Score Valid:", valid_acc.round(2))
print("Score Test:", test_acc.round(2))

Logistic Regressor
Score Train: 0.71
Score Valid: 0.67
Score Test: 0.69


Sadly, the Logistic Regressor model has very low accuracy even on the training set. This could be due to some features being irrelevant to the target or some of the features have quite a few outliers. Whatever the case, it seems that this model won't be able to work for this dataset.

# Conclusion

From all the models I've trained, I've concluded that the Decision Tree has the best results compared to all the other models. On depth 5, it has the best accuracy, with all 3 sets not too far from each other, which indicates that the model has the best performance in generalizing the data. The Random FOrest model, while having a high accuracy, has too high a score on its training set; up to 97% on the best parameter; which is too huge a gap between it and the valid and test sets. These are signs that the model shows overfitting, and might read in to patterns that might not actually exist. The last model-the Logistic Regressor-has very low accuracy for all sets, not even passing the 75% minimum accuracy score mark, thereby showing that this model is not fit for our current data.