# Feature Engineering
1. Dropping Irrelevant Columns: Use the drop() function to remove any columns that are deemed irrelevant for the task at hand. Adjust the column names or add additional columns to be dropped if needed.

2. Encoding Categorical Variables: Categorical variables are encoded using the LabelEncoder from scikit-learn's preprocessing module. This ensures that the categorical variables are converted into numeric format, which is required by many machine learning algorithms. Adjust the cat_cols variable to include the column names of your categorical variables.

3. Feature Selection using Chi-Square Test: The SelectKBest class from scikit-learn's feature_selection module is used with the chi-square (chi2) score function to select the top K features with the highest dependency on the target variable. Adjust the k parameter to specify the desired number of features to select.

4. Printing Selected Features: Finally, the code prints the names of the selected features based on the chi-square test results.

In [2]:
import pandas as pd

In [3]:
# Load the dataset
df = pd.read_csv('/content/Telco-Customer-Churn.csv')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
# Drop irrelevant columns
df = df.drop(['customerID'], axis=1)

In [6]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder

In [7]:
# Encode categorical variables
cat_cols = df.select_dtypes(include='object').columns
le = LabelEncoder()
df[cat_cols] = df[cat_cols].apply(le.fit_transform)

In [8]:
# Separate features and target variable
X = df.drop('Churn', axis=1)
y = df['Churn']

In [9]:
# Perform feature selection using chi-square test
selector = SelectKBest(score_func=chi2, k=5)
X_new = selector.fit_transform(X, y)

In [10]:
# Get the selected feature names
selected_features = X.columns[selector.get_support(indices=True)].tolist()
print("Selected Features:")
print(selected_features)

Selected Features:
['tenure', 'OnlineSecurity', 'Contract', 'MonthlyCharges', 'TotalCharges']


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [12]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42)

In [13]:
# Create a Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

In [14]:
# Make predictions on the training set
y_train_pred = model.predict(X_train)

In [15]:
# Evaluate the model on the training set
accuracy_train = accuracy_score(y_train, y_train_pred)
precision_train = precision_score(y_train, y_train_pred)
recall_train = recall_score(y_train, y_train_pred)
f1_train = f1_score(y_train, y_train_pred)

print("Training Set Metrics:")
print("Accuracy:", accuracy_train)
print("Precision:", precision_train)
print("Recall:", recall_train)
print("F1-Score:", f1_train)

Training Set Metrics:
Accuracy: 0.7838125665601704
Precision: 0.6069230769230769
Recall: 0.5274064171122995
F1-Score: 0.5643776824034336


In [16]:
# Make predictions on the testing set
y_test_pred = model.predict(X_test)

In [17]:
# Evaluate the model on the testing set
accuracy_test = accuracy_score(y_test, y_test_pred)
precision_test = precision_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)

print("\nTesting Set Metrics:")
print("Accuracy:", accuracy_test)
print("Precision:", precision_test)
print("Recall:", recall_test)
print("F1-Score:", f1_test)


Testing Set Metrics:
Accuracy: 0.7877927608232789
Precision: 0.6114457831325302
Recall: 0.5442359249329759
F1-Score: 0.575886524822695


We can see that by using feature engineering we were able to fix the issue and now we are getting real results!