# Your Very First Machine Learning (ML) Model: Logistic Regression

Dataset: [College Student Placement Factors Dataset](https://www.kaggle.com/datasets/sahilislam007/college-student-placement-factors-dataset) (`data/college_student_placement_dataset.csv`)

In [4]:
pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.14.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.6 (from statsmodels)
  Downloading patsy-1.0.2-py2.py3-none-any.whl.metadata (3.6 kB)
Downloading statsmodels-0.14.5-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownloading patsy-1.0.2-py2.py3-none-any.whl (233 kB)
Installing collected packages: patsy, statsmodels
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [statsmodels][0m [statsmodels]
[1A[2KSuccessfully installed patsy-1.0.2 statsmodels-0.14.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0

In [5]:
# Import our libraries.

from IPython.display import display

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import statsmodels.api as sm

## Constants

It is a good programming practice to use constants to avoid repetition errors and to save yourself the effort of retyping the expression by _centralizing_ semantically identical values.

`DATASET_PATH` identifies the path to the dataset being loaded and operated on. `RANDOM_STATE` makes otherwise random operations reproducible run after run. Keep whatever value you set it to unless you want slightly different results.

In [9]:
DATASET_PATH = "../data/college_student_placement_dataset.csv"
RANDOM_STATE = 45  # DO NOT CHANGE THIS RANDOM STATE.

## Preliminary Inspection

See what the raw file looks like!

In [10]:
# Look at the first 5 lines of theraw contents of the file first.

with open(DATASET_PATH, "r") as file:
    for line_number in range(5):
        if line := file.readline():
            print(line)
        else:
            break  # Stop; there are less than 5 lines.

College_ID,IQ,Prev_Sem_Result,CGPA,Academic_Performance,Internship_Experience,Extra_Curricular_Score,Communication_Skills,Projects_Completed,Placement

CLG0030,107,6.61,6.28,8,No,8,8,4,No

CLG0061,97,5.52,5.37,8,No,7,8,0,No

CLG0036,109,5.36,5.83,9,No,3,1,1,No

CLG0055,122,5.47,5.75,6,Yes,1,6,1,No



## Preliminary Load

Load the data. This is not the final form of the data which will be used, but it’s a `DataFrame` for further inspection so we can decide what to do with it next.

In [11]:
# Load the dataset into a pandas dataframe.

df = pd.read_csv("../data/college_student_placement_dataset.csv")
df.head()

Unnamed: 0,College_ID,IQ,Prev_Sem_Result,CGPA,Academic_Performance,Internship_Experience,Extra_Curricular_Score,Communication_Skills,Projects_Completed,Placement
0,CLG0030,107,6.61,6.28,8,No,8,8,4,No
1,CLG0061,97,5.52,5.37,8,No,7,8,0,No
2,CLG0036,109,5.36,5.83,9,No,3,1,1,No
3,CLG0055,122,5.47,5.75,6,Yes,1,6,1,No
4,CLG0004,96,7.91,7.69,7,No,8,10,2,No


## Null Values Check
Inspect which varibles may be good / not good for using as features based on null values. 


In [14]:
# Identify which columns have null values.
df.isnull().sum()


College_ID                0
IQ                        0
Prev_Sem_Result           0
CGPA                      0
Academic_Performance      0
Internship_Experience     0
Extra_Curricular_Score    0
Communication_Skills      0
Projects_Completed        0
Placement                 0
dtype: int64

## Duplicate Rows Check
If so, remove the duplicates.

In [16]:
# Check to see if our data has any duplicate rows.
duplicates_rows = df.duplicated().sum()
print(duplicates_rows)

0


Many clean… 😐

## Categorical Categories

What are the categories for the categorial-looking (i.e., non-numeric) columns?

In [18]:
categorical_cols = df.select_dtypes(include=['object']).columns

print(f"Categorical columns: {list(categorical_cols)}")

Categorical columns: ['College_ID', 'Internship_Experience', 'Placement']


## Feature Engineering

Non-numeric columns containing `'Yes'` and `'No'` do not work with logistic regression. Binary categories can be converted an integral type (`int`) with a value of 0 or 1.

In [21]:

pd.get_dummies(df, columns=["Internship_Experience", "Placement"], drop_first=True)


Unnamed: 0,College_ID,IQ,Prev_Sem_Result,CGPA,Academic_Performance,Extra_Curricular_Score,Communication_Skills,Projects_Completed,Internship_Experience_1,Placement_1
0,CLG0030,107,6.61,6.28,8,8,8,4,False,False
1,CLG0061,97,5.52,5.37,8,7,8,0,False,False
2,CLG0036,109,5.36,5.83,9,3,1,1,False,False
3,CLG0055,122,5.47,5.75,6,1,6,1,True,False
4,CLG0004,96,7.91,7.69,7,8,10,2,False,False
...,...,...,...,...,...,...,...,...,...,...
9995,CLG0021,119,8.41,8.29,4,1,8,0,False,True
9996,CLG0098,70,9.25,9.34,7,0,7,2,False,False
9997,CLG0066,89,6.08,6.25,3,3,9,5,True,False
9998,CLG0045,107,8.77,8.92,3,7,5,1,False,False


❔ When should you use this versus using `pd.get_dummies`?

# Visualization with `sns.pairplot`

In [None]:
# Use sns.pariplot to visualize.


## Feature Selection 

Choose the columns corresponding to the features _IQ_ and _internship experience_ to be your `X`. Target _placement_ as your `y`.

In [None]:
# Set X to the desired features.


# Set y to be our target variable.

## Split to Testing and Training Datasets 

In [None]:
# Split our data into testing and training pairs.


# Print the length and width of our testing data.
print("X_train: %d rows, %d columns" % X_train.shape)
print("X_test: %d rows, %d columns" % X_test.shape)
print("y_train: %d rows, 1 column" % y_train.shape)
print("y_test: %d rows, 1 column" % y_test.shape)

## Build and train your model

Initialize an empty Logistic Regression model, and then fit your model to your training data. 

In [None]:
# Initalize our logistic regressionmodel.

## Evaluation

Make predictions with your test data and save the predictions as `y_pred`.

In [None]:
# 1. Make predictions of your test data and save them as `y_pred`.


y_pred

Calculate and print the accuracy, precision, recall, and F1 scores of your model.

In [None]:
# 2. Calculate and print the accuracy, precision, recall, and F1 scores of your model.

print("Accuracy Score: %f" % )
print("Precision Score: %f" % )
print("Recall Score: %f" % )
print('F1 Score: %f' % )

Plot a confusion matrix of your predicted results.

In [None]:
# 3. Plot a confusion matrix of your predicted results.

How many true positives and true negatives did your model get?

In [None]:
# How many true positives and true negatives did your model get?

true_negatives, false_positives, false_negatives, true_positives = ?
print('True Negatives: %d' % true_negatives)
print('True Positives: %d' % true_positives)

Such awful 😞

# What is the Most Important Feature
 
Use `statsmodel` to create a summary report. Interpret the results.

In [None]:
# Add a constant term to the independent variables.


# Fit the model.


# Print the summary and interpret the results.

# Extra Credit: Use your brain and make a better model (as in better scores).



In [None]:
# Define the new X variable, and reuse the same y variable from before.


# Split our data into testing and training. Remember to use the same random state as you used before


# Initalize our model.


# Fit-train our model using our training data.


# Make new predicitions using our testing data.


# Print each of our scores to inspect performance.


# Plot the confusion matrix.