<a href="https://colab.research.google.com/github/salomon-alvarez/churn_prediction_project/blob/main/churn_prediction_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import files
import pandas as pd

# Upload file
uploaded = files.upload()

# Read into pandas (replace 'your_file.csv' with the uploaded filename)
df = pd.read_csv("Churn_Modelling.csv")

df.head()

Saving Churn_Modelling.csv to Churn_Modelling.csv


Unnamed: 0,CustomerId,CredRate,Geography,Gender,Age,Tenure,Balance,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited
0,15634602,619,France,Female,42.0,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41.0,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42.0,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39.0,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43.0,2,125510.82,1,1,1,79084.1,0


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   CredRate         10000 non-null  int64  
 2   Geography        10000 non-null  object 
 3   Gender           9996 non-null   object 
 4   Age              9994 non-null   float64
 5   Tenure           10000 non-null  int64  
 6   Balance          10000 non-null  float64
 7   Prod Number      10000 non-null  int64  
 8   HasCrCard        10000 non-null  int64  
 9   ActMem           10000 non-null  int64  
 10  EstimatedSalary  9996 non-null   float64
 11  Exited           10000 non-null  int64  
dtypes: float64(3), int64(7), object(2)
memory usage: 937.6+ KB


## Data Cleaning
### Handling missing values

In [4]:
df.isnull().any()

Unnamed: 0,0
CustomerId,False
CredRate,False
Geography,False
Gender,True
Age,True
Tenure,False
Balance,False
Prod Number,False
HasCrCard,False
ActMem,False


In [6]:
df.describe()

Unnamed: 0,CustomerId,CredRate,Age,Tenure,Balance,Prod Number,HasCrCard,ActMem,EstimatedSalary,Exited
count,10000.0,10000.0,9994.0,10000.0,10000.0,10000.0,10000.0,10000.0,9996.0,10000.0
mean,15690940.0,650.5288,38.925255,5.0128,76485.889288,1.5302,0.7055,0.5151,100074.744083,0.2037
std,71936.19,96.653299,10.489248,2.892174,62397.405202,0.581654,0.45584,0.499797,57515.774555,0.402769
min,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,50974.0775,0.0
50%,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100168.24,0.0
75%,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [7]:
# Count distinct values in the "Gender" column
df["Gender"].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Male,5453
Female,4543


In [8]:
df["Gender"].fillna("Male", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Gender"].fillna("Male", inplace=True)


In [10]:
df["Gender"].isna().sum()

np.int64(0)

### Handling Missing Values in Gender Column

The `Gender` column contained missing values. Since the majority of entries were labeled as **Male**, I filled the null values with `"Male"` to maintain consistency and reduce bias in the dataset.

In [11]:
# Calculate the mean salary (excluding nulls)
mean_salary = df["EstimatedSalary"].mean()

# Replace nulls with the mean
df["EstimatedSalary"] = df["EstimatedSalary"].fillna(mean_salary)

df["EstimatedSalary"].isna().sum()

np.int64(0)

### Handling Missing Values in EstimatedSalary

The `EstimatedSalary` column contained missing values. To preserve the overall distribution without biasing toward a specific group, I replaced the null values with the **mean salary** of the column.

In [13]:
import math

# Calculate mean and round up
mean_age = math.ceil(df["Age"].mean())

# Replace nulls with the rounded mean
df["Age"] = df["Age"].fillna(mean_age)

# Check if there are any nulls left
print("Nulls in Age column:", df["Age"].isna().sum())

Nulls in Age column: 0


### Handling Missing Values in Age

The `Age` column had missing values. Since the mean age was a decimal, I rounded it **up to the nearest whole number** and used this value to fill the null entries. This ensures the filled values remain realistic as ages are whole numbers.

### Chaging Column Names

In [17]:
# Rename columns
df.rename(columns={
    "CredRate": "Credit_Score",
    "ActMem": "Is_Active_Member",
    "Prod Number": "Num_of_Products",
    "Exited": "Churn",
    "HasCrCard": "Has_Credit_Card"
}, inplace=True)

# Check if changes worked
print(df.columns)

Index(['CustomerId', 'Credit_Score', 'Geography', 'Gender', 'Age', 'Tenure',
       'Balance', 'Num_of_Products', 'Has_Credit_Card', 'Is_Active_Member',
       'EstimatedSalary', 'Churn'],
      dtype='object')


### Changing Data Type

In [18]:
# List of categorical columns
categorical_cols = ["Gender", "Geography", "Is_Active_Member", "Churn", "Has_Credit_Card"]

# Convert to category dtype
for col in categorical_cols:
    df[col] = df[col].astype("category")

# Check dtypes
print(df.dtypes)

CustomerId             int64
Credit_Score           int64
Geography           category
Gender              category
Age                  float64
Tenure                 int64
Balance              float64
Num_of_Products        int64
Has_Credit_Card     category
Is_Active_Member    category
EstimatedSalary      float64
Churn               category
dtype: object
