# Megaline Machine Learning
Project Report by Allentine Paulis

# Table of Contents
* [Project Description](#description)
* [Data](#data)
* [Step 1. Understanding Data](#understanding)
* [Step 2. Splitting Data](#split)   
* [Step 3. Machine Learning model](#model)
* [Step 4. Test Set Model](#test)
* [Step 5. Overall conclusion](#allconclusion)

# Project Description <a class="anchor" id="description"></a>
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

We have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, we need to develop a model that will pick the right plan. Since we have already performed the data preprocessing step, we can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

# Data <a class="anchor" id="data"></a>

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
- `сalls` — number of calls,
- `minutes` — total call duration in minutes,
- `messages` — number of text messages,
- `mb_used` — Internet traffic used in MB,
- `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).

# Step 1. Understanding Data  <a class="anchor" id="understanding"></a>

In [2]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/users_behavior.csv')

In [4]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [7]:
df.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1092,80.0,555.85,140.0,24412.57,1
391,40.0,238.62,16.0,24437.02,0
450,7.0,48.39,16.0,4016.94,1
2487,67.0,439.01,20.0,24095.57,0
2168,66.0,443.44,43.0,21871.05,0
1279,22.0,128.51,64.0,19271.51,0
3193,47.0,232.6,72.0,20994.96,0
1999,56.0,398.45,4.0,23682.94,0
244,156.0,1058.22,59.0,18932.66,1
274,60.0,442.36,0.0,16446.23,0


# Step 2. Splitting Data  <a class="anchor" id="split"></a>
Split the source data into a training set, a validation set, and a test set.