LOGISTIC REGRESSION LEARNING TASKS

1. Email Spam — Logistic Regression (binary)
Task: Fit is_spam ~ features with LogisticRegression (scaled). Report accuracy, precision, recall, F1, ROC-AUC, and a confusion matrix.
Columns: word_free, word_offer, word_click, num_links, num_caps, sender_reputation, is_spam

In [54]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score,recall_score,f1_score,roc_auc_score,confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

df=pd.read_csv('email_spam.csv')

X=df.iloc[:,:-1]
y=df.iloc[:,-1]

X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42)

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

model=LogisticRegression()
model.fit(X_train,y_train)

y_pred=model.predict(X_test)

print(f'Accuracy: {accuracy_score(y_test,y_pred)}')
print(f'Precision: {precision_score(y_test,y_pred)}')
print(f'Recall: {recall_score(y_test,y_pred)}')
print(f'f1: {f1_score(y_test,y_pred)}')
print(f'ROC-AUC: {roc_auc_score(y_test,y_pred)}')
print(f'Confusion matrix: {confusion_matrix(y_test,y_pred)}')

Accuracy: 0.3
Precision: 0.18181818181818182
Recall: 0.09523809523809523
f1: 0.125
ROC-AUC: 0.3107769423558898
Confusion matrix: [[10  9]
 [19  2]]


2. 2) Customer Churn — Logistic Regression (binary)
Task: Fit churn ~ tenure_months + monthly_charges + support_tickets + is_premium + avg_usage_hours (scaled). Report metrics as above.

In [55]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,confusion_matrix
from sklearn.model_selection import train_test_split

df=pd.read_csv('customer_churn.csv')

X=df.iloc[:,:-1]
y=df.iloc[:,-1]

X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42)

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

model=LogisticRegression()
model.fit(X_train,y_train)

y_pred=model.predict(X_test)

print(f'Accuracy: {accuracy_score(y_test,y_pred)}')
print(f'Precision: {precision_score(y_test,y_pred)}')
print(f'Recall: {recall_score(y_test,y_pred)}')
print(f'f1: {f1_score(y_test,y_pred)}')
print(f'ROC-AUC: {roc_auc_score(y_test,y_pred)}')
print(f'Confusion matrix: {confusion_matrix(y_test,y_pred)}')

Accuracy: 0.6
Precision: 0.6521739130434783
Recall: 0.6521739130434783
f1: 0.6521739130434783
ROC-AUC: 0.5907928388746804
Confusion matrix: [[ 9  8]
 [ 8 15]]


3. Disease Stage — Multiclass Logistic Regression
Task: Fit multinomial logistic for stage ∈ {0,1,2} using age, b1..b4. Report accuracy, macro-F1, weighted-F1 and confusion matrix.

In [56]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

df=pd.read_csv('disease_stage.csv')

X=df.iloc[:,:-1]
y=df.iloc[:,-1]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

model=LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)

y_pred=model.predict(X_test)

#average in f1_score --> controls how scores are aggregated across classes
#macro --> Equal importance to all classes (just average F1 of each class) and ignores how many samples each class has
#weighted --> Multiplies F1 of each class by its number of true samples (support)
print(f'Accuracy: {accuracy_score(y_test,y_pred)}')
print(f'Macro-f1: {f1_score(y_test,y_pred,average="macro")}')   
print(f'Weighted-f1: {f1_score(y_test,y_pred,average="weighted")}') 
print(f'Confusion matrix: {confusion_matrix(y_test,y_pred)}')

Accuracy: 0.25
Macro-f1: 0.23164874551971326
Weighted-f1: 0.25556182795698923
Confusion matrix: [[7 6 3]
 [3 1 7]
 [5 6 2]]


4. Flowers — k-NN Classification with CV
Task: Fit k-NN on sepal_length, sepal_width, petal_length, petal_width. Use 5-fold CV to choose k ∈ {1,3,…,25}. Report best k, CV score, test accuracy, and confusion matrix.

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,cross_val_score

df=pd.read_csv('flowers.csv')

X=df.iloc[:,:-1]
y=df.iloc[:,-1]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

#why setting odd values not even for k --> #Odd k reduces ties during majority voting. 
#If k=4, class0=2 and class1=2 --> tie; if k=3, you will get either 2:1 or 3:0 --> majority
k=list(range(1,26,2))
cv_score=[]
for i in k:
  model=KNeighborsClassifier(n_neighbors=i)
  score=cross_val_score(model,X_train,y_train,cv=5)   #for each k value, it will get 5 accuracy values (as cv=5) and calculate their average
  cv_score.append(score.mean())  #stores the average in cv_scores array

#(metric used here is accuracy) higher accuracy --> better model. So we take max of cv_score
best_k=k[np.argmax(cv_score)]   #it retreives index of max cv_score value to get best k

final_model=KNeighborsClassifier(n_neighbors=best_k)   #creating a k-NN model using best_k
final_model.fit(X_train,y_train)

y_pred=final_model.predict(X_test)

print(f'Accuracy: {accuracy_score(y_test,y_pred)}')
print(f'Confusion matrix: {confusion_matrix(y_test,y_pred)}')

Accuracy: 0.225
Confusion matrix: [[5 6 7]
 [6 2 4]
 [3 5 2]]


5. Airbnb Prices — k-NN Regression with CV
Task: Fit k-NN regressor on size_m2, distance_center_km, rating, num_reviews. Use 5-fold CV to pick k ∈ {1,3,…,25} (scaling required). Report CV RMSE, test RMSE, and test R².

In [None]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,cross_val_score

df=pd.read_csv('airbnb.csv')

X=df.iloc[:,:-1]
y=df.iloc[:,-1]

X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42)

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

k=list(range(1,26,2))
cv_score=[]
for i in k:
  model=KNeighborsRegressor(n_neighbors=i)
  score=cross_val_score(model,X_train,y_train,cv=5,scoring='neg_mean_squared_error')   #it returns negative mse value
  cv_score.append(np.sqrt(-score.mean()))   #makes the negative mse value positive

#(metric used there is rmse) lower rmse --> better model. So we take min of cv_score
best_k=k[np.argmin(cv_score)]

final_model=KNeighborsRegressor(n_neighbors=best_k)
final_model.fit(X_train,y_train)

y_pred=final_model.predict(X_test)
print(f'CV RMSE: {min(cv_score)}')
print(f'Test RMSE: {np.sqrt(mean_squared_error(y_test,y_pred))}')
print(f'R²: {r2_score(y_test,y_pred)}')

CV RMSE: 143.92254850275012
Test RMSE: 113.13890983519951
R²: -0.1616135778243346
