# Model Quality and Improvements

## 1. Defining the Question

### a) Data Analysis Question

Can I develop a model that predicts the right plans for Megaline subscribers with good accuracy.

### b) Metric for Success

The machine learning model should predict subscribers' plan with an accuracy score of atleast 0.75

### c) Understanding the context 

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra

### d) Experimental Design

1. Data Importation
2. Data Exploration
3. Data Cleaning
4. Data Preparation
5. Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)
6. Model Evaluation
7. Hyparameter Tuning
8. Findings and Recommendations

### e) Data Relevance

The datasets consists of features that are relevant enough to help in predicting the target variable, *is_ultra*. The  predictor features are: calls, minutes, messages and mb_used.

## 2. Reading the Data

In [None]:
# Importing our libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
# Load the data below
# --- 
dataset_url = "https://raw.githubusercontent.com/wambasisamuel/DE_Week04_Wednesday/main/users_behavior.csv"
df = pd.read_csv(dataset_url) 

In [None]:
# Checking the first 5 rows of data
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [None]:
# Checking the last 5 rows of data
# ---
df.tail(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3209,122.0,910.98,20.0,35124.9,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0
3213,80.0,566.09,6.0,29480.52,1


In [None]:
# Checking number of rows and columns
df.shape

(3214, 5)

In [None]:
# Checking datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Observations:

*   The are 3214 observations in the dataset.
*   The dataset has 5 features.
*   All the features are numerical.



## 3. External Data Source Validation

The provided dataset has features typical to those that can be found in any typical Telco.

## 4. Data Preparation

### Data Standardisation

In [None]:
# Standardise column names
df.columns

# Column names are already in standard form. No further action needed

Index(['calls', 'minutes', 'messages', 'mb_used', 'is_ultra'], dtype='object')

### Data Cleaning

#### Missing Data

In [None]:
# Checking missing entries of all the variables
# ---
# 
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

There are no missing values in the dataset.

#### Duplicate data

In [None]:
# Find the total duplicate records
df.duplicated().sum()

0

There are no duplicate records.

## 5. Data Modelling

The target variable has only two values 0 and 1. **This is a binary classification problem**. For this reason I will develop and tune classification models

In [None]:
df['is_ultra'].unique()

array([0, 1])

### Splitting the dataset

I will split the dataset in the ratio 3:1:1 corresponding to training,testing and validation datasets

In [None]:
df_tr, df_test = train_test_split(df, test_size=0.2, random_state=12345)
df_train, df_valid = train_test_split(df_tr, test_size=0.25, random_state=12345)


# features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.20, random_state=12345)
# features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.2, random_state=12345 )

# Training, Test and Validation features and targets
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

### Decision Tree Modelling

In [None]:
dt_model = DecisionTreeClassifier(random_state=12345)

# Train the model
dt_model.fit(features_train, target_train)

# Get model score
dt_model_score = dt_model.score(features_valid, target_valid)
dt_model_score

0.7122861586314152

### Random Forest Modelling

In [None]:
rf_model = RandomForestClassifier(random_state=12345)

# Train the model
rf_model.fit(features_train, target_train)

# Get model score
rf_model_score = rf_model.score(features_valid, target_valid)
rf_model_score

0.7947122861586314

### Logic Regression Modelling

In [None]:
lr_model = LogisticRegression(random_state=12345, solver='liblinear')

# Train the model
lr_model.fit(features_train, target_train)

# Get model score
lr_model_score = lr_model.score(features_valid, target_valid)
lr_model_score

0.7293934681181959

## 6. Model Evaluation

* Random forest model has the highest accuracy of 0.7947.

## 7. Hyperparameter Tuning

### Decison Tree

I will tune the max_depth hyperparameter

In [None]:
from sklearn.metrics import accuracy_score
for depth in range(1, 15):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(model.score(features_valid, target_valid))
    #print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7387247278382582
max_depth = 2 : 0.7573872472783826
max_depth = 3 : 0.7651632970451011
max_depth = 4 : 0.7636080870917574
max_depth = 5 : 0.7589424572317263
max_depth = 6 : 0.7573872472783826
max_depth = 7 : 0.7744945567651633
max_depth = 8 : 0.7667185069984448
max_depth = 9 : 0.7620528771384136
max_depth = 10 : 0.7713841368584758
max_depth = 11 : 0.7589424572317263
max_depth = 12 : 0.7558320373250389
max_depth = 13 : 0.749611197511664
max_depth = 14 : 0.7573872472783826


The best result is 0.7745 when the depth is 7

### Random Forest

I will tune the n_estimators parameter

In [None]:
for estimator in range(1, 101, 5):
    model = RandomForestClassifier(random_state=12345, n_estimators=estimator)
    model.fit(features_train, target_train)
    valid_score = model.score(features_valid, target_valid)
    print('estimators =', estimator, ': ', valid_score)

estimators = 1 :  0.702954898911353
estimators = 6 :  0.7698289269051322
estimators = 11 :  0.7807153965785381
estimators = 16 :  0.7838258164852255
estimators = 21 :  0.7884914463452566
estimators = 26 :  0.7838258164852255
estimators = 31 :  0.7869362363919129
estimators = 36 :  0.7884914463452566
estimators = 41 :  0.7869362363919129
estimators = 46 :  0.7947122861586314
estimators = 51 :  0.7916018662519441
estimators = 56 :  0.7931570762052877
estimators = 61 :  0.7931570762052877
estimators = 66 :  0.7978227060653188
estimators = 71 :  0.7962674961119751
estimators = 76 :  0.7978227060653188
estimators = 81 :  0.7962674961119751
estimators = 86 :  0.7978227060653188
estimators = 91 :  0.7947122861586314
estimators = 96 :  0.7931570762052877


Highest accuracy is obtained when estimators are between 60 and 70.

In [None]:
for estimator in range(60, 70):
    model = RandomForestClassifier(random_state=12345, n_estimators=estimator)
    model.fit(features_train, target_train)
    valid_score = model.score(features_valid, target_valid)
    print('estimators =', estimator, ': ', valid_score)

estimators = 60 :  0.7962674961119751
estimators = 61 :  0.7931570762052877
estimators = 62 :  0.7962674961119751
estimators = 63 :  0.7947122861586314
estimators = 64 :  0.7978227060653188
estimators = 65 :  0.7993779160186625
estimators = 66 :  0.7978227060653188
estimators = 67 :  0.7978227060653188
estimators = 68 :  0.7993779160186625
estimators = 69 :  0.7993779160186625


The best result is 0.7994 when the estimators are 65.

## 7. Model Quality

The Random Forest model has the best performance hence I will do a sanity check on it using the test data set.

### Test dataset

In [None]:
model = RandomForestClassifier(n_estimators=65, random_state=12345)
model.fit(features_valid, target_valid)

test_score = model.score(features_test, target_test)
test_score

0.7791601866251944

### Sanity Check

In [None]:
# Random Sampling
(df['is_ultra']==0).sum() / df.shape[0]

0.693528313627878

The model performs better than random sampling.

## 8. Summary and Recommendations

Below are the findings:

1. The best model is Random Forest model.  
2. The model performs better than random sampling.



