# Lab 3: More supervised learning (linear models)

Your objectives for this lab are to:
* perform a regression task with `LinearRegression` and interpret its outputs,
* implement L1 (`Lasso`) and L2 (`Ridge`) regularization to understand how they affect coefficients in a linear regression model, and
* perform a classification task with `LogisticRegression`, interpret its outputs, and adjust the regularization strenght (`C`).

First, make the necessary imports for today with the code below.

In [None]:
import numpy as np
import pandas as pd
import warnings


from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)

# Part 1: Regression

Your first task is to implement linear regression models to predict the median property for different residential districts. Since property prices can, in prinple, range from zero to infinity, this is a regression task — we want our models to output a continuous value, not a categorical class label.

For this exercise, we'll use the California housing dataset provided by `sklearn`. Take a couple minutes to familiarize yourself with the data here: https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset. Then, read in the dataset with the code below:

In [6]:
housing = fetch_california_housing(as_frame=True)
housing = housing.frame
housing.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Question 1

Now let's implement a standard `LinearRegression` model: First, split `housing` into a training set and a test set. Then, fit the model to the training set. Finally, print the model's score (the R-squared) on traing and test set.

In [10]:

print(housing.isnull().sum())

MedInc         0
HouseAge       0
AveRooms       0
AveBedrms      0
Population     0
AveOccup       0
Latitude       0
Longitude      0
MedHouseVal    0
dtype: int64


In [11]:
cleandf= housing.drop(['MedHouseVal'], axis=1)

x=cleandf
y=housing['MedHouseVal']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3)

Lreg= LinearRegression()

Lreg.fit(x_train,y_train)

print(Lreg.score(x_train,y_train))
print(Lreg.score(x_test,y_test))



0.6038296826740983
0.6086510210510824




## Question 2

Now that you have a `LinearRegression` model that's been fitted to the training data, use the code below to print the model coefficients. Then, in the text cell below, write out your interpretation of the coefficients (e.g., Which attribute is most strongly associated with property prices? What does that tell you?)


In [14]:
print(Lreg.coef_)




[ 4.46632963e-01  9.17640574e-03 -1.24596160e-01  6.51851876e-01
 -6.83475637e-06 -3.95157466e-03 -4.08478902e-01 -4.21780813e-01]


*...write your interpretation of the coefficients here!*

so median income house age and average bedrooms seem to be the largest drivers of price but it is worth noting in the lecture that correlation does not equal causation


## Question 3

Now let's implement a `Ridge` regression model. Just like you did above, fit the model to the training data, then print the model's scores on the training and test set... how do the scores compare to the scores output by the `LinearRegression` model?

(*Hint: the scores should be about the same, unless you tinker with the `alpha` hyperparameter*)

In [15]:
x=cleandf
y=housing['MedHouseVal']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3)

Rreg= Ridge()

Rreg.fit(x_train,y_train)

print(Rreg.score(x_train,y_train))
print(Rreg.score(x_test,y_test))



0.6069488918417678
0.6014616750717642


## Question 4

Now implement a `Lasso` regression model. Once again, fit the model to the training data, then print the model's scores on the training and test set... how do the scores compare to the scores output by the `LinearRegression` model?

(*Hint: the scores should **not** be the same*)

In [16]:
x=cleandf
y=housing['MedHouseVal']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3)

Lareg= Lasso()

Lareg.fit(x_train,y_train)

print(Lareg.score(x_train,y_train))
print(Lareg.score(x_test,y_test))



0.284841052997165
0.28821144009175437


## Question 5

Inspect the coefficients from all three models side-by-side: Create and print a dataframe with all the coefficients such that there's a row for each attribute in the `housing` dataset, and a column for each of your models (linear regression, ridge, and lasso).

In [None]:
coefficients = {
    'Linear Regression': Lreg.coef_,
    'Ridge': Rreg.coef_,
    'Lasso': Lareg.coef_
}

coeff_df = pd.DataFrame(coefficients, index=cleandf.columns)
print(coeff_df)




            Linear Regression     Ridge     Lasso
MedInc               0.446633  0.445226  0.146006
HouseAge             0.009176  0.009452  0.005672
AveRooms            -0.124596 -0.120786  0.000000
AveBedrms            0.651852  0.775427 -0.000000
Population          -0.000007 -0.000007 -0.000010
AveOccup            -0.003952 -0.003874 -0.000000
Latitude            -0.408479 -0.420074 -0.000000
Longitude           -0.421781 -0.432640 -0.000000


## Question 6

Write down your interpretation of the coefficients. How do the coefficients differ across models? What can we infer from this?

*...write your interpretation of the coefficients here!*

The coefficients differ due to regularization which is applied to both the ridge and lasso model with ridge using L1 it cannot reach as high of a coefficient as standard linear due to the penalty placed on large coefficients. Lasso instead picked the relevant variables using L2 ultimately showing coefficients with a value of 0

# Part 2: Classification

Your next task is to implement a logistic regression model to predict which customers of a telecommunications company will churn or not. Since we want a model that outputs a discrete class label — "churn" or "no churn" — this is a classification task.

For this exercise, we'll use a dataset called `telco.csv`. Each row represents a customer and there are many attributes describing each customer (e.g., `tenure` records the number of months the customer has been with the company; `PaperlessBilling` records whether the customer has paperless billing or not). The target variable is `Churn`, where 0 indicates no churn and 1 indicates churn.

## Question 7

There are several different ways to import an external data file to colab (see here: https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/).

Perhaps the simplest way is to import the file manually with the following steps:
1. Download the data file (`telco.csv`) to your own device
2. Click the file icon on the left-side bar of this colab window
3. Drag and drop the data file into the file menu to the left
4. Run the following code to read in the data file: `df = pd.read_csv("telco.csv")`

Now you try. Import the `telco.csv` data file with whatever method you prefer and define it as a pandas dataframe called `df`.

In [42]:

df = pd.read_csv("telco.csv")
df.head()
cleandf= df.drop(['Churn'], axis=1)


In [43]:
print(df.isnull().sum())

tenure                                   0
Partner                                  0
Dependents                               0
gender                                   0
PhoneService                             0
PaperlessBilling                         0
MonthlyCharges                           0
TotalCharges                             0
MultipleLines_No phone service           0
MultipleLines_Yes                        0
InternetService_Fiber optic              0
InternetService_No                       0
OnlineSecurity_No internet service       0
OnlineSecurity_Yes                       0
OnlineBackup_No internet service         0
OnlineBackup_Yes                         0
DeviceProtection_No internet service     0
DeviceProtection_Yes                     0
TechSupport_No internet service          0
TechSupport_Yes                          0
StreamingTV_No internet service          0
StreamingTV_Yes                          0
StreamingMovies_No internet service      0
StreamingMo

In [44]:
x=cleandf
print(x.shape, y.shape)

(7032, 29) (7032,)


Don't ask why i did this it is witchcraft and i refuse to elaborate.

## Question 8

Define `X` and `y`, and then make a train-test split. Set `stratify=y` to ensure that the distribution of class labels present in all the data is reflected in both the training and test sets.

In [50]:

y=df['Churn']
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, stratify=y)


## Question 9

Implement a `LogisticRegression` model to classify churn: fit the model to the training data, then print the model's scores on the training and test set

(*Hint: If you get a warning message about model convergence, try setting `max_iter=1000` when defining your `LogisticRegression`.*)

In [47]:
Loreg= LogisticRegression()

Loreg.fit(x_train,y_train)

print(Loreg.score(x_train,y_train))
print(Loreg.score(x_test,y_test))



0.8055668427468509
0.7985781990521327


## Question 10

Inspect the coefficients (sorted by coefficient values) and write down your interpretation of them... Which attribute is the strongest predictor of churn? What does the coefficient value tell you?

(*Hint: remember the log-odds scale?*)

In [48]:

print(Loreg.coef_)


[[-7.81432322e-02 -9.59762700e-02 -2.66665089e-01  2.37067624e-02
  -2.45501037e-01  3.36740539e-01  6.91307589e-03  4.41243842e-04
   1.10862015e-01  2.63128924e-01  3.75717490e-01 -1.21621524e-01
  -1.21621524e-01 -4.76278125e-01 -1.21621524e-01 -1.26196592e-01
  -1.21621524e-01 -1.36154436e-01 -1.21621524e-01 -4.57065568e-01
  -1.21621524e-01  4.61277156e-02 -1.21621524e-01  2.39546192e-02
  -2.24468153e-01 -3.48078374e-01 -1.58441786e-01  3.85777877e-01
  -2.63445363e-01]]


*...write your interpretation of the coefficients here!*


## Question 11

Just like a linear regression model can be regularized with the L1 (`Lasso`) and L2 (`Ridge`) penalties, so too can logistic regression. But unlike with `LinearRegression`, regularizing `LogisticRegression` just involves adjusting its hyperparameters — namely, `C` (the inverse regularization strength) and `penalty` (which penalty term to apply).

Write a loop to test many different values for the `C` hyperparameter for `LogisticRegression`. Create a list of results, with the training score, test score, and `C` value. Print the list, sorted by test score.


In [53]:
C_values = [0.01, 0.1, 1, 10, 100]
results = []
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, stratify=y)

for C in C_values:
    Loreg = LogisticRegression(C=C, max_iter=1000).fit(x_train, y_train)
    train_score = Loreg.score(x_train, y_train)
    test_score = Loreg.score(x_test, y_test)
    results.append((C, train_score, test_score))

# Sort by test score
sorted_results = sorted(results, key=lambda x: x[2], reverse=True)
print("C values sorted by test score:")
for res in sorted_results:
    print(f"C: {res[0]}, Train Score: {res[1]:.4f}, Test Score: {res[2]:.4f}")



C values sorted by test score:
C: 1, Train Score: 0.8039, Test Score: 0.8062
C: 100, Train Score: 0.8039, Test Score: 0.8062
C: 0.1, Train Score: 0.8066, Test Score: 0.8052
C: 10, Train Score: 0.8043, Test Score: 0.8047
C: 0.01, Train Score: 0.8015, Test Score: 0.8033
