# Wassim Mecheri - Lab 3

# Classification

# 0 Imports and loading the Dataset

## 0.1 Imports

We import the needed libraries:
- Pandas for data manipulation and analysis
- NumPy to convert our DataFrames to arrays
- train_test_split to create our test and train DataFrames
- LogisticRegression, which is our model of choice
- accuracy_score, precision_score, recall_score, and confusion_matrix to calculate the metrics

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

## 0.2 Load the Dataset

We read our .csv file and create a DataFrame from it.

In [2]:
df = pd.read_csv('survey lung cancer.csv')
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


# 1 Preprocessing Data

## 1.1 Turning non numerical values to numerical values

To work with this DataFrame and perform logistic regression, we need to convert non-numerical values to numerical values so the model can interpret all the data. In our case, GENDER and LUNG_CANCER need to be converted.

In [3]:
df.loc[df['GENDER']=='M','GENDER']=0
df.loc[df['GENDER']=='F','GENDER']=1
df.loc[df['LUNG_CANCER']=='YES','LUNG_CANCER']=1
df.loc[df['LUNG_CANCER']=='NO','LUNG_CANCER']=0
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,0,69,1,2,2,1,1,2,1,2,2,2,2,2,2,1
1,0,74,2,1,1,1,2,2,2,1,1,1,2,2,2,1
2,1,59,1,1,1,2,1,2,1,2,1,2,2,1,2,0
3,0,63,2,2,2,1,1,1,1,1,2,1,1,2,2,0
4,1,63,1,2,1,1,1,1,1,2,1,2,2,1,1,0


## 1.2 Define features and target

Now that all our values are numerical, we can define our features and target. The goal of this lab is to predict the likelihood of lung cancer, so our target is LUNG_CANCER, and the rest are the features.

In [4]:
y = np.array(df['LUNG_CANCER'], dtype=int)
X = np.array(df.drop(columns='LUNG_CANCER'))

## 1.3 Creating train and test DataFrame

We then split our DataFrame into train and test data for the next step.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# 2 Fitting the Model

## 2.1 Creating the model

First, we create our model, which is a logistic regression.

In [6]:
model = LogisticRegression()

## 2.2 Training the model with train data

Then we train the model using the train data.

In [7]:
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Notes

This warning message suggests that we should increase the number of iterations or scale the data. It means that the model couldn't find the best solution with it's current number of iteration. We can solve that by adding the max_iter parameter to LogisticRegression.

## 2.3 Model’s parameters

We can display the model’s parameters. The coefficients, which correspond to the weight of the features, show how a feature impacts the prediction of lung cancer: a positive value increases the chance of lung cancer, and a negative value decreases it. The intercept corresponds to the bias, which is the default value when all the features are equal to 0; in our case, it’s around -15, which is reassuring!

In [8]:
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Coefficients: [[ 8.04689769e-02 -3.97381500e-04  6.82301313e-01  1.16950739e+00
  -1.78961286e-01  7.52775900e-01  1.65184104e+00  1.74133577e+00
   1.03368086e+00  4.09350939e-01  1.25511782e+00  1.39542104e+00
  -1.44173730e-01  1.71776321e+00  5.58904790e-01]]
Intercept: [-15.65398895]


## 2.4 Try prediction on test data

Once the model is trained, we try to do prediction on the test data.

In [9]:
y_pred_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

# 3 Calculating Metrics on Test Data

Finally, to quantify the model's performance, we calculate the metrics.

In [10]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")

recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.2f}")

conf_matrix = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)
print(f"Specificity: {specificity:.2f}")

Accuracy: 0.98
Precision: 0.98
Recall: 1.00
Specificity: 0.50


# 4 Conclusion

The results look good overall, with an accuracy and precision of 98%. The recall is 100%, which means the model correctly identified every actual lung cancer case. On the other hand, the specificity is 50%, meaning it flagged non-cancer cases as cancer one time over two.