# Machine Learning for Phone Plans

# Contents <a id='back'></a>

* [Introduction](#introduction)
* [Data Exploration](#data_preprocessing)
    * [Conclusion](#conclusion)
* [Splitting the Source Data](#spliting_data)
    * [Conclusions](#splitting_data_conclusions)
* [Investigating the Quality of Different Models](#quality_check)
  * [Decision Tree Classifier](#decision_tree)
  * [Random Forest](#random_forest)
  * [Logical Regression](logical_regression)
  * [Conclusion](#conclusion)
* [Checking Model Quality](#checking)
* [Sanity Check](#sanity_check)
* [Findings](#conclusions)

## Introduction

The mobile carrier, Megaline, wants to develop a model that analyzes subscribers' behavior in order to recommend one of their newer plans: **Smart** or **Ultra**. We will use machine learning to develop a model for a classification task with the highest possible accuracy. 

**Data Description:**
* `calls` - number of calls
* `minutes` - total call duration in minutes
* `messages` - number of text messages
* `mb_used` - Internet traffic used in MB
* `is_ultra` - plan for the current month (Ultra - 1, Smart - 0)

**Objectives**:
* Split the source data into training set, validation set, and test set
* Investigate the quality of different models 
* Check the qulaity of the model using the test set
* Perform a sanity check on the model

## Initialization 

In [130]:
# Loading all libraries 

import pandas as pd 
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [94]:
# Loading the data
try:
  df = pd.read_csv('users_behavior.csv')
except:
  df = pd.read_csv('/datasets/users_behavior.csv')

## Data Exploration

Description of data:
* `calls` - number of calls
* `minutes` - total call duration in minutes
* `messages` - number of text messages
* `mb_used` - Internet traffic used in MB
* `is_ultra` - plan for the current month (Ultra - 1, Smart - 0)

In [95]:
# Obtaining the first 5 rows of the table
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [96]:
# Obtaining the numbers of rows and columns 
shape = df.shape
print('The table has {} rows and {} columns'.format(shape[0], shape[1]))

The table has 3214 rows and 5 columns


In [97]:
# Obtaining general info on the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Since each column has 3,214 rows, there are no missing values in the data. 

In [98]:
# Checking for duplicates
df.duplicated().sum()

0

**Conclusion**

Our dataframe has 5 columns, each with 3,214 rows. There are two data types in this table, float and integer. The values in the columns, `calls`, `minutes`, `messages` and `mb_used`, are float type. The values in the column, `is_ultra`, are integer type. There are no missing values or duplicates.

Since the data has already been preprocessed, we will proceed by splitting the source data into a training set, a validation set, and a test set.

## Splitting the Source Data

In [99]:
# Splitting the data into 60% training, 20% validation, 20% test set

# Split data into training and validation
df_train, df_valid = train_test_split(df, test_size=0.20, random_state=1234)

# Split data into training and testing
df_train, df_test = train_test_split(df_train, test_size=0.25, random_state=1234)

In [100]:
# Checking to see if data was split properly
print('The training set contains {} rows'.format(df_train.shape[0]) + ', which represents 60% of the data')
print()
print('The validation set contains {} rows'.format(df_valid.shape[0]) + ', which represents 20% of the data')
print()
print('The test set contains {} rows'.format(df_test.shape[0]) + ', which represents 20% of the data')

The training set contains 1928 rows, which represents 60% of the data

The validation set contains 643 rows, which represents 20% of the data

The test set contains 643 rows, which represents 20% of the data


In [101]:
# Declare variables for features and target
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

# Checking size of each features and target set
print('Size of training feature:', features_train.shape)
print('Size of training target:', target_train.shape)
print()
print('Size of validation feautres:', features_valid.shape)
print('Size of validation target:', target_valid.shape)
print()
print('Size of test features:', features_test.shape)
print('Size of test target:', target_test.shape)

Size of training feature: (1928, 4)
Size of training target: (1928,)

Size of validation feautres: (643, 4)
Size of validation target: (643,)

Size of test features: (643, 4)
Size of test target: (643,)


**Conclusions**

We split the source data in a 3:1:1 ratio: 60% training set, 20% validation set, and 20% test set. 
* The training set contains 1928 rows
* The validation set contains 643 rows
* The test set contains 643 rows

We decided that the features of our sets would be every column except for `is_ultra` and that the target would be the `is_ultra` column.

We will now proceed by investigating the qualities of each model. Since this is a classification task, the models we will be using are decision tree classifier, random forest, and logisitic regression. 

## Investigating the Quality of Different Models

### Decision Tree Classifier

The most important hyperparameter of the decision tree is max_depth. We will create a loop that iterates over different values of max_depth to find the value that gives us the highest accuracy.



In [102]:
# Creating a loop for max_depth from 1 to 50
highest_accuracy = 0
best_depth = 0
for depth in range (1,50):
  model = DecisionTreeClassifier(random_state=1234, max_depth=depth)
  model.fit(features_train, target_train)
  predictions_valid = model.predict(features_valid)
  accuracy = accuracy_score(target_valid, predictions_valid)
  if accuracy > highest_accuracy:
    highest_accuracy = accuracy
    best_depth = depth
print('Max depth = ', best_depth, ':', highest_accuracy)


Max depth =  3 : 0.7884914463452566


We found that the max_depth of 5 gave us the highest accuracy value for our training data. It has a 78.8% accuracy rate. 

### Random Forest 

The random forest algorithm trains a large number of independent trees and makes a decision by voting. We will tune the hyperparameter, n_estimators to find the value that gives us the highest accuracy.

In [103]:
# Creating a loop for n_estimators from 1 to 50 
highest_accuracy = 0
max_n_estimators = 0 
for n in range(1,50):
  model = RandomForestClassifier(random_state=1234, n_estimators=n)
  model.fit(features_train, target_train)
  predictions_valid = model.predict(features_valid)
  accuracy = accuracy_score(target_valid, predictions_valid)
  if accuracy > highest_accuracy:
    highest_accuracy = accuracy 
    max_n_estimators = n
print('n_estimators = ', max_n_estimators, ':', highest_accuracy)


n_estimators =  18 : 0.8149300155520995


We found that the value of n_estimators of 18 gave us the highest accuracy for our training data. It has a 81.5% accuracy rate.

### Logical Regression 

Logistic regression estimates probabilities by using a logistic function. We will be using the "liblinear" solver to find our accuracy rate. 

In [104]:
# Finding accuracy for the logical regression model 
model = LogisticRegression(random_state=1234, solver='liblinear')
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
accuracy = accuracy_score(target_valid, predictions_valid)
print(f'Accuracy: {accuracy}')

Accuracy: 0.7293934681181959


The accuracy rate for the logical regression model is 72.9%.

**Conclusions**

For the decision tree classifier model, we found that the max_depth of 5 gave us the highest accuracy value of .788 for our training data. For the random forest model, we found that the n_estimators of 18 gave us the highest accuracy value of .815. Using the "liblinear" solver for the logical regression model, we got an accuracy value of .729. The best overall model for our data was the random forest model, with an accuracy rate of 81.5%.

We will now proceed with checking the quality of the random forest model by using the test set.

## Checking Model Quality 

In [105]:
random_forest_model = RandomForestClassifier(random_state=1234, n_estimators=18)
random_forest_model.fit(features_train, target_train)
test_prediction = random_forest_model.predict(features_test)
accuracy = accuracy_score(target_test, test_prediction)
print(f'Accuracy: {accuracy}')

Accuracy: 0.8009331259720062


The accuracy of the model is 80.1%. The model has passed the accuracy threshold of 75%. 

We will now sanity check the model. 

## Sanity Check 

In [124]:
# Splitting our data
x_train, x_test, y_train, y_test = train_test_split(features_train, target_train, random_state=1234)


In [131]:
# Creating a Random Forest model
model = RandomForestClassifier(random_state=1234, n_estimators=18)
model.fit(x_train, y_train)

In [137]:
# Finding the accuracy of model
prediction = model.predict(x_test)
accuracy = accuracy_score(y_test, prediction)
print(f'Accuracy of model: {accuracy}')

Accuracy of model: 0.8070539419087137


In [143]:
# Creating a confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, prediction).ravel()
print(f'True positives: {tn}')
print(f'False positives: {fp}')
print(f'False negatives: {fn}')
print(f'True negatives: {tp}')

True positives: 320
False positives: 26
False negatives: 67
True negatives: 69


Out of the 346 predictions, the model predicted 320 of them correctly. 

In [135]:
# Confusion matrix metrics
matrix = classification_report(y_test, prediction)
print('Classification report: \n', matrix)

Classification report: 
               precision    recall  f1-score   support

           0       0.83      0.92      0.87       346
           1       0.73      0.51      0.60       136

    accuracy                           0.81       482
   macro avg       0.78      0.72      0.74       482
weighted avg       0.80      0.81      0.80       482



The precision of our model was right 73% of the time. Recall (or sensitivity) is the percentage of positive values correctly classfied. Our recall in this case is 51%. 

## Findings

We split the source data in a 3:1:1 ratio: 60% training set, 20% validation set, and 20% test set.
* The training set contains 1928 rows.
* The validation set contains 643 rows.
* The test set contains 643 rows. 

We created three different models: decision tree classifier, random forest, and logical regression. We investigated the quality of each model by changing hyperparameters
* For the decision tree classifier model, we found that the max_depth of 5 gave us the highest accuracy rate. The accuracy rate was 78.8%.
* For the random forest model, we found that n_estimators of 15 gave us the highest accuracy rate. The accuracy rate was 81.5%.
* For the logical regression model, we used the “liblinear” solver and obtained an accuracy rate of 72.9%.

The random forest classifier turned out to be the best model for our data. The logical regression model wat the least accurate model. 

We checked the quality of the random forest model using the test set and got an accuracy rate of 80.1%. We also performed a sanity check for the model by calculating the precision, recall, f1-score, and accuracy.

Since the threshold for accuracy was 0.75 and our model obtained an accuracy of 81%, we can say that the model achieved what the task required. In other words, the model is able to predict the right phone plan for subscribers 81% of the time.
