# Data Preprocessing and Preparation for Clustering Analysis

This notebook covers the steps involved in preprocessing the dataset to prepare it for clustering analysis. We will handle duplicates, missing values, encode categorical variables, and scale the features.

## Step 1: Upload the File

You will be prompted to upload your dataset file.

In [1]:
from google.colab import files
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import stats

# Upload the file
uploaded = files.upload()

# Load the dataset
data = pd.read_csv(next(iter(uploaded.keys())))  # Automatically get the file name from uploaded files

Saving Dataset.csv to Dataset.csv


## Step 2: Check for Duplicates

In this step, we check for any duplicate rows in the dataset and remove them if they exist.

> Add blockquote



In [2]:
# Checking for duplicates in the dataset
duplicate_count = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Remove duplicates if any
data = data.drop_duplicates()

Number of duplicate rows: 103


## Step 3: Check for Missing Values and Handle Them

Next, we identify any missing values in the dataset and fill them with appropriate values. For numerical columns, we fill the missing values with the mean of that column.

In [3]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values in each column:", missing_values)

# Fill missing numerical values with the mean
data.fillna({
    'MonthlyCharges': data['MonthlyCharges'].mean() if 'MonthlyCharges' in data.columns else None,
    'tenure': data['tenure'].mean() if 'tenure' in data.columns else None
}, inplace=True)

Missing Values in each column: gender             0
SeniorCitizen      0
Dependents         0
tenure             0
PhoneService       0
MultipleLines      0
InternetService    0
Contract           0
MonthlyCharges     0
Churn              0
dtype: int64


## Step 4: Descriptive Statistics

We calculate the mean, median, mode, and standard deviation for numerical columns to understand the distribution and spread of the data. Additionally, we check for skewness to assess the symmetry of the data distribution.

In [5]:
# Calculate descriptive statistics
numeric_data = data.select_dtypes(include=['float64', 'int64'])  # Select only numeric columns

mean_values = numeric_data.mean()
median_values = numeric_data.median()
mode_values = numeric_data.mode().iloc[0]
std_dev_values = numeric_data.std()
skewness_values = numeric_data.skew()

# Display the descriptive statistics
print("Mean Values:\n", mean_values)
print("Median Values:\n", median_values)
print("Mode Values:\n", mode_values)
print("Standard Deviation:\n", std_dev_values)
print("Skewness Values:\n", skewness_values)


Mean Values:
 SeniorCitizen      0.163977
tenure            32.612392
MonthlyCharges    65.211081
dtype: float64
Median Values:
 SeniorCitizen      0.0
tenure            29.0
MonthlyCharges    70.6
dtype: float64
Mode Values:
 SeniorCitizen      0.00
tenure             1.00
MonthlyCharges    20.05
Name: 0, dtype: float64
Standard Deviation:
 SeniorCitizen      0.370281
tenure            24.461349
MonthlyCharges    29.904985
dtype: float64
Skewness Values:
 SeniorCitizen     1.815484
tenure            0.228468
MonthlyCharges   -0.239939
dtype: float64


## Step 5: Encoding Categorical Variables

We encode the categorical variables using one-hot encoding, converting them into numeric format suitable for machine learning algorithms.

In [6]:
# Convert categorical variables to numeric using one-hot encoding
encoded_data = pd.get_dummies(data, drop_first=True)

# Display the first few rows of the encoded data
encoded_data.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Male,Dependents_Yes,PhoneService_Yes,MultipleLines_Yes,InternetService_Fiber optic,Contract_One year,Contract_Two year,Churn_Yes
0,0,1,29.85,False,False,False,False,False,False,False,False
1,0,34,56.95,True,False,True,False,False,True,False,False
2,0,2,53.85,True,False,True,False,False,False,False,True
3,0,45,42.3,True,False,False,False,False,True,False,False
4,0,2,70.7,False,False,True,False,True,False,False,True


## Step 6: Save Preprocessed Dataset

The preprocessed dataset is saved before we proceed with splitting it into training and testing sets.

In [7]:
# Save the preprocessed dataset before splitting
encoded_data.to_csv('preprocessed_data.csv', index=False)
files.download('preprocessed_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Step 7: Splitting the Data

We split the preprocessed dataset into training and testing sets. The training set is used to train the model, and the testing set is used to evaluate its performance.

In [8]:
# Define features (X) and target (y)
X = encoded_data.drop('Churn_Yes', axis=1)  # Replace 'Churn_Yes' with your target variable column
y = encoded_data['Churn_Yes']  # Replace 'Churn_Yes' with your target variable column

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5552, 10), (1388, 10), (5552,), (1388,))

## Step 8: Feature Scaling

Feature scaling is performed using StandardScaler to standardize the features by removing the mean and scaling to unit variance.

In [9]:
# Initialize the scaler for feature scaling
scaler = StandardScaler()

# Fit the scaler on the training data and transform both the training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert the scaled data back to a DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Display the first few rows of the scaled training data
X_train_scaled.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender_Male,Dependents_Yes,PhoneService_Yes,MultipleLines_Yes,InternetService_Fiber optic,Contract_One year,Contract_Two year
0,-0.440138,-0.921304,0.354467,-0.994253,-0.65898,0.332933,1.153609,1.125632,-0.513363,-0.564315
1,-0.440138,1.567905,0.407951,-0.994253,-0.65898,0.332933,-0.866845,-0.88839,-0.513363,1.772059
2,-0.440138,-0.431624,-1.333629,1.00578,-0.65898,0.332933,1.153609,-0.88839,-0.513363,1.772059
3,-0.440138,-1.166144,0.947807,-0.994253,-0.65898,0.332933,1.153609,1.125632,-0.513363,-0.564315
4,-0.440138,-1.002918,-1.504109,1.00578,1.517497,0.332933,-0.866845,-0.88839,-0.513363,1.772059


## Step 9: Save Scaled Data

Finally, the scaled training and testing datasets, along with the target variables, are saved for future use.

In [11]:
# Save the training and testing datasets
X_train_scaled.to_csv('X_train_scaled.csv', index=False)
X_test_scaled.to_csv('X_test_scaled.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

# Download the datasets
files.download('X_train_scaled.csv')
files.download('X_test_scaled.csv')
files.download('y_train.csv')
files.download('y_test.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>