# Introduction to Machine Learning Project 

In this project, My goal is to develop a model that would analyze Mobile carrier Megaline subscriber's behavior and recommend one of Megaline's newer plans: Smart or Ultra. I want to develop a model with the highest possible accuracy. In this project, my threshold for accuracy will be 0.75. 

In order to do that, i will perform the following stages:

1. Open and look through the data file

2. Split the source data into a training set, a validation set, and a test set.

3. Investigate the quality of different models by changing hyperparameters. 

4. Sanity check the model

5. Check the accuracy using the test dataset.



In [16]:
# Loading all the libraries i will use:
import pandas as pd
import math 
from scipy import stats as st
import warnings
warnings.filterwarnings("ignore")
import requests 
import io

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


## Open and look through the data file


In [17]:
url = "https://raw.githubusercontent.com/yoav-karsenty/Machine-learning--accuracy-score/main/users_behavior%20(1).csv" 
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

plan = pd.read_csv(io.StringIO(download.decode('utf-8')))

# Printing out the first 5 rows of the dataframe
plan.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [8]:
plan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


No missing values and no columns types to change, we are good to go

## Splitting the source data into a training set, a validation set, and a test set.


Next, i will split the dataset into a training set, a validation set, and a test set. 
The sizes of validation set and test set are usually equal, so i will use  split the data in a 3:1:1 ratio: 20% for the test, 20% for validation and 60% for training


In [9]:
#splitting the data 
train_set, plan_test = train_test_split(plan, test_size=0.2,train_size=0.8, random_state=54321)
plan_train, plan_valid = train_test_split(train_set, random_state=54321,test_size = 0.25,train_size =0.75) 

#Declaring features and target variables 
features_train = plan_train.drop(['is_ultra'], axis=1)
target_train = plan_train['is_ultra']
features_valid = plan_valid.drop(['is_ultra'], axis=1)
target_valid = plan_valid['is_ultra']
features_test = plan_test.drop(['is_ultra'], axis=1)
target_test = plan_test['is_ultra']

## Investigating the quality of different models by changing hyperparameters. 


Next, I want to Investigate the quality of different models by changing hyperparameters. This nees to use the classification models for this project, becasue we are dealing with two potential outcomes :1 = is ultra, 0 = is not ultra. Also, because I want to develop a model with the highest possible accuracy, the models i am going to examine are RandomForestClassifier and LogisticRegression.

In [10]:
#creating a loop that checks the best number for the n_estimators Hyperparameter and max_depth Hyperparameter
best_model = None
best_result = 0.70
best_est = 0
best_depth = 0
for est in range(1, 100):
    for depth in range (1, 30):
        model = RandomForestClassifier(random_state=54321, n_estimators=est,max_depth = depth ) 
        model.fit(features_train,target_train) # train model on training set
        score = model.score(features_valid,target_valid) # calculating accuracy score on validation set
        if score > best_result: 
            
            best_result = score
          
            best_model = model
            best_result = score
            best_est = est
            best_depth = depth
print("The best score:",best_result,"Best est : ",best_est,"best_depth : ",best_depth)


The best score: 0.8460342146189735 Best est :  5 best_depth :  9




We got an 0.84 score using the RandomForestClassifier with max_depth = 5 and n_estimators = 9. This score is better than the threshold that was set, 0.75, but i still want to check if the model can't improve.  

In [11]:
#creating a loop that checks the best number for the max_iter Hyperparameter and the best solver Hyperparameter

best_model = None
l_best_result = 0.70
best_solver = 'lbfgs'
best_iter = 0
for iter_it in range (1, 1000,10):
    for s in ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']:
        

        lr_model = LogisticRegression(random_state=12345, solver=s,max_iter = iter_it) 
        lr_model.fit(features_train, target_train) 
        lr_score = lr_model.score(features_valid, target_valid) 
        if lr_score > l_best_result:
            l_best_result = lr_score
            best_model = lr_model
            l_best_result = lr_score
            best_solver = s
            best_iter = iter_it

print("The best score:",l_best_result,"Best solver : ",best_solver," best itter : ",best_iter)


The best score: 0.7791601866251944 Best solver :  newton-cg  best itter :  31


As we can see, using LogisticRegression didnt improve the model's accuracy (now its 0.77), so i will use the previews model. 

## Sanity checking the model


Next, i want to Sanity check the model. In order to do that i will create a dummy model that always predicts the same value. In this case i will take the most common value in the target column, and will create a 1D pandas series where for each row in the dataset the value matches the value i refered to. I want to check that the model i trained is more accurate than the dummy model.

In [12]:
#getting the most commpn value in  test_target 
plan_test.is_ultra.value_counts()

0    427
1    216
Name: is_ultra, dtype: int64

In [13]:
#creating a series with that value.
cooked_pred = pd.Series(0, index=plan_test.index)

In [14]:
#performing an accuracy test to the dummy model 
accuracy_sanity = accuracy_score(target_test, cooked_pred)
accuracy_sanity

0.6640746500777605

As we can see, the Sanity check went well. The model i trained performs better than the dummy model.
Now, i can test the final model.

## Check the accuracy using the test dataset.


Next, I will test the final model on the test dataset and see if the training was successful.

In [15]:
#Testing the final model with the test dataset
final_model = RandomForestClassifier(random_state=54321, n_estimators=best_est,max_depth = best_depth) 
final_model.fit(features_train,target_train) # train model on training set
final_score = final_model.score(features_test,target_test) # calculate accuracy score on validation set
final_score

0.7791601866251944

We got a 0.77 score, wich makes our model a successful model based on the 0.75 threshold.