## Descripcion de datos

Archivos con los datos:

   - `contract.csv` — Informacion de contratos
   - `personal.csv` — Informacion de cliente
   - `internet.csv` — Informacion de servicios de internet
   - `phone.csv` — Informacion de servicios de telefonia
    
Cada archivo tiene el 'customerID' en comun. Esto se puede usar para mergear los datos y obtener una sola tabla.

In [1]:
import pandas as pd
import numpy as np
from functools import reduce

In [2]:
try:
    df_contract = pd.read_csv('contract.csv')
    df_internet = pd.read_csv('internet.csv')
    df_personal = pd.read_csv('personal.csv')
    df_phone = pd.read_csv('phone.csv')
except:
    df_contract = pd.read_csv('/datasets/final_provider/contract.csv')
    df_internet = pd.read_csv('/datasets/final_provider/internet.csv')
    df_personal = pd.read_csv('/datasets/final_provider/personal.csv')
    df_phone = pd.read_csv('/datasets/final_provider/phone.csv')

In [3]:
def inspect_dataframe(df, n_rows=3):
    print(df.info())
    display(df.sample(n_rows))

In [4]:
dataframes = [
    ("Contract", df_contract),
    ("Internet", df_internet),
    ("Personal", df_personal),
    ("Phone", df_phone)
]

for df_name, df in dataframes:
    print(f"{'-' * 30}\n{df_name}\n{'-' * 30}")
    inspect_dataframe(df)
    print("\n")

------------------------------
Contract
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB
None


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
4279,8815-LMFLX,2018-01-01,No,Month-to-month,Yes,Bank transfer (automatic),25.4,546.85
6941,2405-LBMUW,2015-01-01,No,One year,Yes,Bank transfer (automatic),50.7,3088.75
6530,0230-UBYPQ,2014-11-01,No,One year,No,Bank transfer (automatic),36.1,2298.9




------------------------------
Internet
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB
None


Unnamed: 0,customerID,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
3787,8714-EUHJO,Fiber optic,No,Yes,No,No,No,Yes
5261,4011-ARPHK,DSL,Yes,No,No,No,No,No
333,5973-EJGDP,Fiber optic,No,Yes,Yes,Yes,No,No




------------------------------
Personal
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB
None


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
2961,6898-MDLZW,Male,0,No,No
5892,2709-UQGNP,Male,0,No,No
2964,9357-UJRUN,Male,0,Yes,No




------------------------------
Phone
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB
None


Unnamed: 0,customerID,MultipleLines
3052,8659-HDIYE,Yes
3736,7247-XOZPB,Yes
1520,9659-QEQSY,Yes






- No se tiene datos vacios o nulos

In [5]:
#Crearemmos una lista con todos los datos
dataframes_list = [df_contract, df_internet, df_personal, df_phone]

# Todos las tablas tienen 'customerID' y usando el join outer se uniran todas las tablas
merged_data = reduce(lambda left, right: pd.merge(left, right, on=['customerID'], how='outer'), dataframes_list)

In [6]:
inspect_dataframe(merged_data)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   InternetService   5517 non-null   object 
 9   OnlineSecurity    5517 non-null   object 
 10  OnlineBackup      5517 non-null   object 
 11  DeviceProtection  5517 non-null   object 
 12  TechSupport       5517 non-null   object 
 13  StreamingTV       5517 non-null   object 
 14  StreamingMovies   5517 non-null   object 
 15  gender            7043 non-null   object 
 16  SeniorCitizen     7043 non-null   int64  


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,gender,SeniorCitizen,Partner,Dependents,MultipleLines
5006,9103-TCIHJ,2018-07-01,2019-10-01 00:00:00,Month-to-month,Yes,Mailed check,55.7,899.8,DSL,Yes,No,No,No,No,No,Female,0,No,No,Yes
1363,3084-DOWLE,2014-02-01,No,Two year,No,Bank transfer (automatic),92.0,6474.4,DSL,Yes,Yes,Yes,Yes,Yes,Yes,Female,0,Yes,No,Yes
5891,4905-JEFDW,2019-02-01,2020-01-01 00:00:00,One year,Yes,Electronic check,41.6,470.6,DSL,No,No,Yes,No,Yes,No,Male,0,No,No,


- Se entiende que los servicios customerID que tienen null o no tienen datos es porque son clientes sin esos servicios. Lo ideal seria completarlo con 'No'

In [7]:
df_nonull = merged_data.fillna('No')
df_nonull.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  gender            7043 non-null   object 
 16  SeniorCitizen     7043 non-null   int64  


In [8]:
df_nonull.sample(5)

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,gender,SeniorCitizen,Partner,Dependents,MultipleLines
6785,3090-HAWSU,2014-10-01,2019-11-01 00:00:00,Two year,Yes,Credit card (automatic),111.6,6876.05,Fiber optic,Yes,No,Yes,Yes,Yes,Yes,Male,0,No,No,Yes
2357,9251-WNSOD,2014-07-01,No,One year,No,Mailed check,75.1,5064.45,DSL,Yes,Yes,Yes,No,No,Yes,Female,0,Yes,No,Yes
2174,1178-PZGAB,2018-07-01,No,One year,No,Credit card (automatic),20.25,383.65,No,No,No,No,No,No,No,Female,0,No,No,No
4621,0311-QYWSS,2019-08-01,No,Month-to-month,Yes,Electronic check,49.45,314.6,DSL,Yes,No,No,No,No,No,Female,0,No,No,No
3165,1834-WULEG,2018-02-01,No,One year,No,Mailed check,20.25,439.75,No,No,No,No,No,No,No,Male,0,Yes,Yes,No
