# Task
3. Read the data into a dataframe (pandas)from the csv file
4. Remove columns that may be irrelevant for churn prediction. Remember, too many columns in kNN, may reduce accuracy.
5. Ifthere are missing values in some data points, remove them from thedata set
6. Convert data to a format usable for scikit-learn
7. Run a kNN algorithm on thedata
8. Find the performance of yourmodel in terms of accuracy

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv")

## Hva vet vi om dataen:
Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]

The dataset includes information about:
- Customers who left within the last month - the column is called Churn
- Services that each customer has signed up for, phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

In [2]:
print(df.shape)

(7043, 21)


In [3]:
df.keys()

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

### Remove values that are irrelevant to churn prediction:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


- SeniorCitizen *
- Partner
- tenure
- PhoneService
- MutipleLines
- InternetService
- OnlineSecurity
- OnlineBackup
- DeviceProtection
- PaperlessBilling
- PaymentMethord

In [5]:
new_df = df.drop(columns = [
    "SeniorCitizen", "Partner", "tenure", "PhoneService",
    "MultipleLines", "InternetService", "OnlineSecurity",
    "OnlineBackup", "DeviceProtection", "PaperlessBilling",
    "PaymentMethod", "customerID"
])

In [6]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   gender           7043 non-null   object 
 1   Dependents       7043 non-null   object 
 2   TechSupport      7043 non-null   object 
 3   StreamingTV      7043 non-null   object 
 4   StreamingMovies  7043 non-null   object 
 5   Contract         7043 non-null   object 
 6   MonthlyCharges   7043 non-null   float64
 7   TotalCharges     7043 non-null   object 
 8   Churn            7043 non-null   object 
dtypes: float64(1), object(8)
memory usage: 495.3+ KB


### Checking for missing values:

In [19]:
new_df.isnull().values.any()

False

In [8]:
new_df.info

<bound method DataFrame.info of       gender Dependents TechSupport StreamingTV StreamingMovies  \
0     Female         No          No          No              No   
1       Male         No          No          No              No   
2       Male         No          No          No              No   
3       Male         No         Yes          No              No   
4     Female         No          No          No              No   
...      ...        ...         ...         ...             ...   
7038    Male        Yes         Yes         Yes             Yes   
7039  Female        Yes          No         Yes             Yes   
7040  Female        Yes          No          No              No   
7041    Male         No          No          No              No   
7042    Male         No         Yes         Yes             Yes   

            Contract  MonthlyCharges TotalCharges Churn  
0     Month-to-month           29.85        29.85    No  
1           One year           56.95       1889

In [9]:
fix_gender = {"Female": 0, "Male": 1}
fix_yesno = {"No": 0, "Yes": 1, "No internet service": 0}
fix_contract = {"Month-to-month": 0, "One year" : 1, "Two year": 2}

new_df["gender"] = [fix_gender[i] for i in new_df.gender]
new_df["Dependents"] = [fix_yesno[i] for i in new_df.Dependents]
new_df["TechSupport"] = [fix_yesno[i] for i in new_df.TechSupport]
new_df["StreamingTV"] = [fix_yesno[i] for i in new_df.StreamingTV]
new_df["StreamingMovies"] = [fix_yesno[i] for i in new_df.StreamingMovies]
new_df["Contract"] = [fix_contract[i] for i in new_df.Contract]
new_df["Churn"] = [fix_yesno[i] for i in new_df.Churn]

#factorize, to_numeric


In [28]:
new_df["TotalCharges"].replace([" ", ""], np.nan, inplace = True)
new_df.dropna(subset=["TotalCharges"], inplace = True)

#pd.to_numeric(new_df["TotalCharges"], downcast="float")
new_df["TotalCharges"].convert_dtypes()

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: TotalCharges, Length: 7032, dtype: string

In [29]:
print(new_df)
new_df.info()


      gender  Dependents  TechSupport  StreamingTV  StreamingMovies  Contract  \
0          0           0            0            0                0         0   
1          1           0            0            0                0         1   
2          1           0            0            0                0         0   
3          1           0            1            0                0         1   
4          0           0            0            0                0         0   
...      ...         ...          ...          ...              ...       ...   
7038       1           1            1            1                1         1   
7039       0           1            0            1                1         1   
7040       0           1            0            0                0         0   
7041       1           0            0            0                0         0   
7042       1           0            1            1                1         2   

      MonthlyCharges TotalC