# Your Very First Machine Learning (ML) Model: Logistic Regression

Dataset: [College Student Placement Factors Dataset](https://www.kaggle.com/datasets/sahilislam007/college-student-placement-factors-dataset) (`data/college_student_placement_dataset.csv`)

In [None]:
# Import our libraries.

from IPython.display import display

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import statsmodels.api as sm

## Constants

It is a good programming practice to use constants to avoid repetition errors and to save yourself the effort of retyping the expression by _centralizing_ semantically identical values.

`DATASET_PATH` identifies the path to the dataset being loaded and operated on. `RANDOM_STATE` makes otherwise random operations reproducible run after run. Keep whatever value you set it to unless you want slightly different results.

In [None]:
DATASET_PATH = "data/college_student_placement_dataset.csv"
RANDOM_STATE = 45  # DO NOT CHANGE THIS RANDOM STATE.

## Preliminary Inspection

See what the raw file looks like!

In [None]:
# Look at the first 5 lines of theraw contents of the file first.

with open(DATASET_PATH, "r") as file:
    for line_number in range(5):
        if line := file.readline():
            print(line)
        else:
            break  # Stop; there are less than 5 lines.

## Preliminary Load

Load the data. This is not the final form of the data which will be used, but it’s a `DataFrame` for further inspection so we can decide what to do with it next.

In [None]:
# Load the dataset into a pandas dataframe.

df = ?
df

## Null Values Check
Inspect which varibles may be good / not good for using as features based on null values. 


In [None]:
# Identify which columns have null values.

## Duplicate Rows Check
If so, remove the duplicates.

In [None]:
# Check to see if our data has any duplicate rows.

Many clean… 😐

## Categorical Categories

What are the categories for the categorial-looking (i.e., non-numeric) columns?

## Feature Engineering

Non-numeric columns containing `'Yes'` and `'No'` do not work with logistic regression. Binary categories can be converted an integral type (`int`) with a value of 0 or 1.

In [None]:
df

❔ When should you use this versus using `pd.get_dummies`?

# Visualization with `sns.pairplot`

In [None]:
# Use sns.pariplot to visualize.

## Feature Selection 

Choose the columns corresponding to the features _IQ_ and _internship experience_ to be your `X`. Target _placement_ as your `y`.

In [None]:
# Set X to the desired features.


# Set y to be our target variable.

## Split to Testing and Training Datasets 

In [None]:
# Split our data into testing and training pairs.


# Print the length and width of our testing data.
print("X_train: %d rows, %d columns" % X_train.shape)
print("X_test: %d rows, %d columns" % X_test.shape)
print("y_train: %d rows, 1 column" % y_train.shape)
print("y_test: %d rows, 1 column" % y_test.shape)

## Build and train your model

Initialize an empty Logistic Regression model, and then fit your model to your training data. 

In [None]:
# Initalize our logistic regressionmodel.

## Evaluation

Make predictions with your test data and save the predictions as `y_pred`.

In [None]:
# 1. Make predictions of your test data and save them as `y_pred`.


y_pred

Calculate and print the accuracy, precision, recall, and F1 scores of your model.

In [None]:
# 2. Calculate and print the accuracy, precision, recall, and F1 scores of your model.

print("Accuracy Score: %f" % )
print("Precision Score: %f" % )
print("Recall Score: %f" % )
print('F1 Score: %f' % )

Plot a confusion matrix of your predicted results.

In [None]:
# 3. Plot a confusion matrix of your predicted results.

How many true positives and true negatives did your model get?

In [None]:
# How many true positives and true negatives did your model get?

true_negatives, false_positives, false_negatives, true_positives = ?
print('True Negatives: %d' % true_negatives)
print('True Positives: %d' % true_positives)

Such awful 😞

# What is the Most Important Feature
 
Use `statsmodel` to create a summary report. Interpret the results.

In [None]:
# Add a constant term to the independent variables.


# Fit the model.


# Print the summary and interpret the results.

# Extra Credit: Use your brain and make a better model (as in better scores).



In [None]:
# Define the new X variable, and reuse the same y variable from before.


# Split our data into testing and training. Remember to use the same random state as you used before


# Initalize our model.


# Fit-train our model using our training data.


# Make new predicitions using our testing data.


# Print each of our scores to inspect performance.


# Plot the confusion matrix.