## Data Description

The data consists of files obtained from different sources:

   - `contract.csv` — contract information
   - `personal.csv` — the client's personal data
   - `internet.csv` — information about Internet services
   - `phone.csv` — information about telephone services
    
In each file, the column `customerID` contains a unique code assigned to each client.

In [12]:
import pandas as pd

In [14]:
try:
    df_contract = pd.read_csv('contract.csv')
    df_internet = pd.read_csv('internet.csv')
    df_personal = pd.read_csv('personal.csv')
    df_phone = pd.read_csv('phone.csv')
except:
    df_contract = pd.read_csv('/datasets/final_provider/contract.csv')
    df_internet = pd.read_csv('/datasets/final_provider/internet.csv')
    df_personal = pd.read_csv('/datasets/final_provider/personal.csv')
    df_phone = pd.read_csv('/datasets/final_provider/phone.csv')

In [15]:
def inspect_dataframe(df, n_rows=3):
    print(df.info())
    display(df.sample(n_rows))

In [16]:
dataframes = [
    ("Contract", df_contract),
    ("Internet", df_internet),
    ("Personal", df_personal),
    ("Phone", df_phone)
]

for df_name, df in dataframes:
    print(f"{'-' * 30}\n{df_name}\n{'-' * 30}")
    inspect_dataframe(df)
    print("\n")

------------------------------
Contract
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB
None


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
562,3701-SFMUH,2019-07-01,No,Month-to-month,No,Credit card (automatic),69.7,516.15
4025,2984-MIIZL,2019-07-01,2019-11-01 00:00:00,Month-to-month,Yes,Bank transfer (automatic),74.8,321.9
4785,8854-CCVSQ,2018-04-01,2019-10-01 00:00:00,Month-to-month,Yes,Electronic check,80.65,1451.9




------------------------------
Internet
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB
None


Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
5116,6661-HBGWL,Fiber optic,No,Yes,Yes,No,Yes,Yes
5298,1596-OQSPS,DSL,No,Yes,No,No,No,No
587,1173-NOEYG,Fiber optic,No,No,Yes,No,No,Yes




------------------------------
Personal
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB
None


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
6681,1389-CXMLU,Male,1,No,No
2004,8565-CLBZW,Male,0,No,No
3859,1732-FEKLD,Female,0,No,No




------------------------------
Phone
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB
None


Unnamed: 0,customerID,MultipleLines
4019,3486-HOOGQ,Yes
5686,0489-WMEMG,No
4874,8313-KTIHG,No






- No se tiene datos vacios o nulos

In [17]:
#Crearemmos una lista con todos los datos
dataframes_list = [df_contract, df_internet, df_personal, df_phone]

# Todos las tablas tienen 'customerID' y usando el join outer se uniran todas las tablas
merged_data = reduce(lambda left, right: pd.merge(left, right, on=['customerID'], how='outer'), dataframes_list)

In [18]:
inspect_dataframe(merged_data)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   InternetService   5517 non-null   object 
 9   OnlineSecurity    5517 non-null   object 
 10  OnlineBackup      5517 non-null   object 
 11  DeviceProtection  5517 non-null   object 
 12  TechSupport       5517 non-null   object 
 13  StreamingTV       5517 non-null   object 
 14  StreamingMovies   5517 non-null   object 
 15  gender            7043 non-null   object 
 16  SeniorCitizen     7043 non-null   int64  


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,gender,SeniorCitizen,Partner,Dependents,MultipleLines
2943,4526-EXKKN,2016-10-01,No,Two year,Yes,Mailed check,24.6,973.95,,,,,,,,Male,0,No,No,Yes
6818,2710-WYVXG,2019-11-01,No,Two year,No,Mailed check,71.1,213.35,DSL,Yes,Yes,No,Yes,No,Yes,Female,0,No,No,No
4096,0829-XXPLX,2018-06-01,No,Month-to-month,Yes,Bank transfer (automatic),89.4,1871.15,Fiber optic,Yes,No,Yes,No,Yes,No,Female,0,No,No,No


- Se entiende que los servicios customerID que tienen null o no tienen datos es porque son clientes sin esos servicios. Lo ideal seria completarlo con 'No'

In [19]:
df_nonull = merged_data.fillna('No')
df_nonull.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  gender            7043 non-null   object 
 16  SeniorCitizen     7043 non-null   int64  


In [20]:
df_nonull.sample(5)

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,gender,SeniorCitizen,Partner,Dependents,MultipleLines
6482,5419-JPRRN,2019-11-01,2019-12-01 00:00:00,Month-to-month,Yes,Electronic check,101.45,101.45,Fiber optic,No,No,Yes,No,Yes,Yes,Male,0,No,No,Yes
6044,1689-YQBYY,2019-02-01,No,Month-to-month,Yes,Electronic check,76.6,893.0,Fiber optic,No,No,No,No,No,No,Female,0,No,Yes,Yes
904,0379-DJQHR,2014-07-01,No,Two year,No,Credit card (automatic),81.35,5398.6,DSL,Yes,Yes,Yes,No,Yes,Yes,Male,0,Yes,Yes,No
1504,1769-GRUIK,2018-08-01,No,Month-to-month,Yes,Electronic check,71.1,1247.75,Fiber optic,No,No,No,No,No,No,Female,0,No,No,No
5083,7136-IHZJA,2016-10-01,No,Month-to-month,Yes,Mailed check,71.35,2847.2,DSL,Yes,Yes,No,Yes,No,Yes,Female,0,Yes,Yes,No
