# Telecom Megaline Statistical Data Analysis
Project Report by Allentine Paulis

# Table of Contents
* [Project Description](#description)
    * [Description of the plans](#plandescription)
    * [Surf](#surfdesc)
    * [Ultimate](#ultimatedesc)
* [Data](#data)
* [Step 1. Understanding Data](#understanding)
    * [Study the general information : Megaline Calls](#studycalls)
    * [Study the general information : Megaline Internet](#studyinternet)
    * [Study the general information : Megaline Messages](#studymessages)
    * [Study the general information : Megaline Plans](#studyplans)
    * [Study the general information : Megaline Users](#studyusers)
* [Step 2. Data Preprocessing](#preprocessing)
* [Step 3. Carry out exploratory data analysis](#eda)
    * [What factors impact price the most? - Based on overall correlation](#factoroverall)
        * [Heatmap Correlation](#heatmap)
        * [Correlation after outliers removal](#outlierscorr)
        * [Comparison Correlation](#comparecorr)
        * [Correlation with Dummies](#corrdum)
* [Step 4. Hypotheses Testing](#hypotest)        
* [Step 5. Overall conclusion](#allconclusion)

# Project Description <a class="anchor" id="description"></a>

As an analyst for the telecom operator Megaline. The company offers its clients two prepaid plans, Surf and Ultimate. The commercial department wants to know which of the plans brings in more revenue in order to adjust the advertising budget.

We are going to carry out a preliminary analysis of the plans based on a relatively small client selection. We'll have the data on 500 Megaline clients: who the clients are, where they're from, which plan they use, and the number of calls they made and text messages they sent in 2018. The task is to analyze clients' behavior and determine which prepaid plan brings in more revenue.

### Description of the plan  <a class="anchor" id="plandescription"></a>

Note: Megaline rounds seconds up to minutes, and megabytes to gigabytes. For **calls**, each individual call is rounded up: even if the call lasted just one second, it will be counted as one minute. For **web traffic**, individual web sessions are not rounded up. Instead, the total for the month is rounded up. If someone uses 1025 megabytes this month, they will be charged for 2 gigabytes.

### Surf <a class="anchor" id="surfdesc"></a>

1. Monthly charge: $20

2. 500 monthly minutes, 50 texts, and 15 GB of data

3. After exceeding the package limits:
    * 1 minute: 3 cents
    * 1 text message: 3 cents
    * 1 GB of data: $10

### Ultimate <a class="anchor" id="ultimatedesc"></a>

1. Monthly charge: $70

2. 3000 monthly minutes, 1000 text messages, and 30 GB of data

3. After exceeding the package limits:
    * 1 minute: 1 cent
    * 1 text message: 1 cent
    * 1 GB of data: $7

# Data <a class="anchor" id="data"></a>

The `users` table (data on users):
- *user_id* — unique user identifier
- *first_name* — user's name
- *last_name* — user's last name
- *age* — user's age (years)
- *reg_date* — subscription date (dd, mm, yy)
- *churn_date* — the date the user stopped using the service (if the value is missing, the calling plan was being used when this database was extracted)
- *city* — user's city of residence
- *plan* — calling plan name


The `calls` table (data on calls):
- *id* — unique call identifier
- *call_date* — call date
- *duration* — call duration (in minutes)
- *user_id* — the identifier of the user making the call


The `messages` table (data on texts):
- *id* — unique text message identifier
- *message_date* — text message date
- *user_id* — the identifier of the user sending the text


The `internet` table (data on web sessions):
- *id* — unique session identifier
- *mb_used* — the volume of data spent during the session (in megabytes)
- *session_date* — web session date
- *user_id* — user identifier


The `plans` table (data on the plans):
- *plan_name* — calling plan name
- *usd_monthly_fee* — monthly charge in US dollars
- *minutes_included* — monthly minute allowance
- *messages_included* — monthly text allowance
- *mb_per_month_included* — data volume allowance (in megabytes)
- *usd_per_minute* — price per minute after exceeding the package limits (e.g., if the package includes 100 minutes, the 101st minute will be charged)
- *usd_per_message* — price per text after exceeding the package limits
- *usd_per_gb* — price per extra gigabyte of data after exceeding the package limits (1 GB = 1024 megabytes)

## Step 1. Understanding Data  <a class="anchor" id="understanding"></a>

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math as mt
from scipy import stats as st
import seaborn as sns

In [2]:
calls = pd.read_csv('https://code.s3.yandex.net/datasets/megaline_calls.csv')
internet = pd.read_csv('https://code.s3.yandex.net/datasets/megaline_internet.csv')
messages = pd.read_csv('https://code.s3.yandex.net/datasets/megaline_messages.csv')
plans = pd.read_csv('https://code.s3.yandex.net/datasets/megaline_plans.csv')
users = pd.read_csv('https://code.s3.yandex.net/datasets/megaline_users.csv')

### Study the general information : Megaline Calls <a class="anchor" id="studycalls"> </a>

In [3]:
calls.head()

Unnamed: 0,id,user_id,call_date,duration
0,1000_93,1000,2018-12-27,8.52
1,1000_145,1000,2018-12-27,13.66
2,1000_247,1000,2018-12-27,14.48
3,1000_309,1000,2018-12-28,5.76
4,1000_380,1000,2018-12-30,4.22


In [4]:
calls.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137735 entries, 0 to 137734
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   id         137735 non-null  object 
 1   user_id    137735 non-null  int64  
 2   call_date  137735 non-null  object 
 3   duration   137735 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.2+ MB


* call_date data type supposed to be datetime

In [5]:
calls.describe(include='all')

Unnamed: 0,id,user_id,call_date,duration
count,137735,137735.0,137735,137735.0
unique,137735,,351,
top,1302_145,,2018-12-27,
freq,1,,1091,
mean,,1247.658046,,6.745927
std,,139.416268,,5.839241
min,,1000.0,,0.0
25%,,1128.0,,1.29
50%,,1247.0,,5.98
75%,,1365.0,,10.69


- There is 0 minimum calls duration, maybe it can be classified as missed call

In [6]:
calls.isna().sum()

id           0
user_id      0
call_date    0
duration     0
dtype: int64

In [7]:
calls.duplicated().sum()

0

In [8]:
calls['id'].nunique() == len(calls)

True

- calls id is unique

In [9]:
calls['user_id'].nunique()

481

- There are 481 unique users in calls

### Study the general information : Megaline Internet <a class="anchor" id="studyinternet"> </a>

In [10]:
internet.head()

Unnamed: 0,id,user_id,session_date,mb_used
0,1000_13,1000,2018-12-29,89.86
1,1000_204,1000,2018-12-31,0.0
2,1000_379,1000,2018-12-28,660.4
3,1000_413,1000,2018-12-26,270.99
4,1000_442,1000,2018-12-27,880.22


In [11]:
internet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104825 entries, 0 to 104824
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            104825 non-null  object 
 1   user_id       104825 non-null  int64  
 2   session_date  104825 non-null  object 
 3   mb_used       104825 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 3.2+ MB


- session_data data types supposed to be datetime

In [12]:
internet.describe(include='all')

Unnamed: 0,id,user_id,session_date,mb_used
count,104825,104825.0,104825,104825.0
unique,104825,,351,
top,1302_145,,2018-12-24,
freq,1,,851,
mean,,1242.496361,,366.713701
std,,142.053913,,277.170542
min,,1000.0,,0.0
25%,,1122.0,,136.08
50%,,1236.0,,343.98
75%,,1367.0,,554.61


- There is 0 mb_used, maybe the users use WiFi not this regular internet connection.

In [13]:
internet.isna().sum()

id              0
user_id         0
session_date    0
mb_used         0
dtype: int64

In [14]:
internet.duplicated().sum()

0

In [15]:
internet['id'].nunique() == len(internet)

True

- Internet id is unique

In [16]:
internet['user_id'].nunique()

489

- There is 489 unique users using internet

### Study the general information : Megaline Messages <a class="anchor" id="studymessages"> </a>

In [17]:
messages.head()

Unnamed: 0,id,user_id,message_date
0,1000_125,1000,2018-12-27
1,1000_160,1000,2018-12-31
2,1000_223,1000,2018-12-31
3,1000_251,1000,2018-12-27
4,1000_255,1000,2018-12-26


In [18]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76051 entries, 0 to 76050
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            76051 non-null  object
 1   user_id       76051 non-null  int64 
 2   message_date  76051 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB


- message_date data types supposed to be datetime

In [19]:
messages.describe(include='all')

Unnamed: 0,id,user_id,message_date
count,76051,76051.0,76051
unique,76051,,351
top,1302_145,,2018-12-28
freq,1,,702
mean,,1245.972768,
std,,139.843635,
min,,1000.0,
25%,,1123.0,
50%,,1251.0,
75%,,1362.0,


In [20]:
messages.isna().sum()

id              0
user_id         0
message_date    0
dtype: int64

In [21]:
messages.duplicated().sum()

0

In [22]:
messages['id'].nunique() == len(messages)

True

- There is unique id in messages

In [23]:
messages['user_id'].nunique()

402

- There is 402 unique users using messages

### Study the general information : Megaline Plans <a class="anchor" id="studyplans"> </a>

In [24]:
plans.head()

Unnamed: 0,messages_included,mb_per_month_included,minutes_included,usd_monthly_pay,usd_per_gb,usd_per_message,usd_per_minute,plan_name
0,50,15360,500,20,10,0.03,0.03,surf
1,1000,30720,3000,70,7,0.01,0.01,ultimate


In [25]:
plans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   messages_included      2 non-null      int64  
 1   mb_per_month_included  2 non-null      int64  
 2   minutes_included       2 non-null      int64  
 3   usd_monthly_pay        2 non-null      int64  
 4   usd_per_gb             2 non-null      int64  
 5   usd_per_message        2 non-null      float64
 6   usd_per_minute         2 non-null      float64
 7   plan_name              2 non-null      object 
dtypes: float64(2), int64(5), object(1)
memory usage: 256.0+ bytes


### Study the general information : Megaline Users <a class="anchor" id="studyusers"> </a>

In [26]:
users.head()

Unnamed: 0,user_id,first_name,last_name,age,city,reg_date,plan,churn_date
0,1000,Anamaria,Bauer,45,"Atlanta-Sandy Springs-Roswell, GA MSA",2018-12-24,ultimate,
1,1001,Mickey,Wilkerson,28,"Seattle-Tacoma-Bellevue, WA MSA",2018-08-13,surf,
2,1002,Carlee,Hoffman,36,"Las Vegas-Henderson-Paradise, NV MSA",2018-10-21,surf,
3,1003,Reynaldo,Jenkins,52,"Tulsa, OK MSA",2018-01-28,surf,
4,1004,Leonila,Thompson,40,"Seattle-Tacoma-Bellevue, WA MSA",2018-05-23,surf,


In [27]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     500 non-null    int64 
 1   first_name  500 non-null    object
 2   last_name   500 non-null    object
 3   age         500 non-null    int64 
 4   city        500 non-null    object
 5   reg_date    500 non-null    object
 6   plan        500 non-null    object
 7   churn_date  34 non-null     object
dtypes: int64(2), object(6)
memory usage: 31.4+ KB


- reg_date data types supposed to be datetime

In [28]:
users.isna().sum()

user_id         0
first_name      0
last_name       0
age             0
city            0
reg_date        0
plan            0
churn_date    466
dtype: int64

In [29]:
users.duplicated().sum()

0

In [30]:
users['churn_date'].isna().sum()/ len(users) * 100

93.2

In [31]:
(len(users) - users['churn_date'].isna().sum())/len(users) * 100

6.800000000000001

- churn has wrong data type and supposed to be datetime
- 93.2% not churned. which means 6.8% customers churned

In [32]:
users['user_id'].nunique() == len(users)

True

- user id is unique and there are 500 unique user id

### Conclusion

- call_date has wrong data type and supposed to be datetime
- There is 0 minimum calls duration, maybe it can be classified as missed call if duration < 0.1
- There are 481 unique users in calls


- session_data has wrong data type and supposed to be datetime
- There is 0 mb_used, it's still possible, maybe the users use WiFi not this regular internet connection.
- There is 489 unique users using internet


- message_date data types supposed to be datetime
- There is 402 unique users using messages


- reg_date has wrong data type and supposed to be datetime
- churn_date has wrong data type and supposed to be datetime
- 6.8% customers churned. 93.2% not churned and null.
- user id is unique and there are total 500 unique user id, but not all using services. 


## Step 2. Data Preprocessing  <a class="anchor" id="preprocessing"></a>

Convert the data to the necessary types
- Find and eliminate errors in the data
- Explain what errors you found and how you removed them.


For each user, find:
- The number of calls made and minutes used per month
- The number of text messages sent per month
- The volume of data per month
- The monthly revenue from each user (subtract the free package limit from the total number of calls, text messages, and data; multiply the result by the calling plan value; add the monthly charge depending on the calling plan)

Fix datetime data types

In [34]:
calls['call_date'] = pd.to_datetime(calls['call_date'])
internet['session_date'] = pd.to_datetime(internet['session_date'])
messages['message_date'] = pd.to_datetime(messages['message_date'])
users['reg_date'] = pd.to_datetime(users['reg_date'])
users['churn_date'] = pd.to_datetime(users['churn_date'])

In [37]:
calls.head()

Unnamed: 0,id,user_id,call_date,duration
0,1000_93,1000,2018-12-27,8.52
1,1000_145,1000,2018-12-27,13.66
2,1000_247,1000,2018-12-27,14.48
3,1000_309,1000,2018-12-28,5.76
4,1000_380,1000,2018-12-30,4.22


In [45]:
calls['day'] = calls['call_date'].dt.day
calls['month'] = calls['call_date'].dt.month
calls['year'] = calls['call_date'].dt.year

In [46]:
calls.head()

Unnamed: 0,id,user_id,call_date,duration,day,month,year
0,1000_93,1000,2018-12-27,8.52,27,12,2018
1,1000_145,1000,2018-12-27,13.66,27,12,2018
2,1000_247,1000,2018-12-27,14.48,27,12,2018
3,1000_309,1000,2018-12-28,5.76,28,12,2018
4,1000_380,1000,2018-12-30,4.22,30,12,2018


In [52]:
calls.dtypes

id                   object
user_id               int64
call_date    datetime64[ns]
duration            float64
day                   int64
month                 int64
year                  int64
dtype: object

In [53]:
calls['id'] = calls['id'].astype('int64')

In [54]:
calls.head()

Unnamed: 0,id,user_id,call_date,duration,day,month,year
0,100093,1000,2018-12-27,8.52,27,12,2018
1,1000145,1000,2018-12-27,13.66,27,12,2018
2,1000247,1000,2018-12-27,14.48,27,12,2018
3,1000309,1000,2018-12-28,5.76,28,12,2018
4,1000380,1000,2018-12-30,4.22,30,12,2018


- The number of calls made and minutes used per month

In [50]:
calls.groupby(['user_id','month']).agg({'duration':'sum','id':'count'}).rename(columns={'duration':'Minutes Used per month','id':'Calls Made'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Minutes Used per month,Calls Made
user_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1
1000,12,116.83,16
1001,8,171.14,27
1001,9,297.69,49
1001,10,374.11,65
1001,11,404.59,64
...,...,...,...
1498,12,324.77,39
1499,9,330.37,41
1499,10,363.28,53
1499,11,288.56,45
