<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/ml/artom-29_04_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Calculating Probability, Odds, and Log Odds

Given:

Probability of winning (win) = 0.25

Calculations:

Probability of losing (loss) = 1 − win = 0.75

Odds = win / loss = 0.25 / 0.75 = 0.333...

Log odds (logit) = ln(odds) = ln(0.333...) ≈ -1.0986

These calculations show how probability, odds, and log odds are related in logistic regression.

2. Understanding Gambler's Odds

Gambler’s odds (g_odds) = 1.98 (decimal odds, common in betting)

To convert to implied probability: s_odds = 1 / g_odds ≈ 0.505

This means that decimal odds of 1.98 correspond to about a 50.5% implied probability of winning.

3. Logistic Regression and the Sigmoid Function

The notebook includes comments explaining that logistic regression predicts the probability of an event (like passing an exam) using the sigmoid function, which transforms any real number into a value between 0 and 1 (interpreted as probability).

The general formula for logistic regression is:

logit(p) = ln(p / (1 - p)) = b₀ + b₁x



In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Load the data
#df = pd.read_csv("/content/StudentsPerformance.csv")

df = pd.read_csv("https://raw.githubusercontent.com/werowe/HypatiaAcademy/refs/heads/master/ml/StudentsPerformance.csv")

# Create a binary outcome
df['pass_math'] = (df['math score'] >= 60).astype(int)

# Convert categorical variables to dummy variables
df_dummies = pd.get_dummies(df[['gender', 'test preparation course']], drop_first=True)

# Prepare features and target
X = df_dummies
y = df['pass_math']

# Fit logistic regression
model = LogisticRegression()
model.fit(df_dummies, y)
# Output
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.3f}")

gender_male: 0.545
test preparation course_none: -0.587


In [59]:
# you can get the predictions based like this

model.predict(df_dummies)


array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [13]:
df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score', 'pass_math'],
      dtype='object')

In [14]:
# use this to concert all the text to numbers


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df["gender"]=le.fit_transform(df["gender"])
df["race/ethnicity"]=le.fit_transform(df["race/ethnicity"])
df["parental level of education"]=le.fit_transform(df["parental level of education"])
df["lunch"]=le.fit_transform(df["lunch"])
df["test preparation course"]=le.fit_transform(df["test preparation course"])

In [19]:
X=df.loc[:, ['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course',]].values

In [20]:
#values makes take the values, i.e., drop the index, and create a numpy array

y=df["math score"].values

In [22]:
# used solver='sag', max_iter=1000 since it gave an error because data was too big for defaults

model2 = LogisticRegression(solver='sag', max_iter=1000)
model2.fit(X, y)

In [67]:
# we run a prediction and save it in the numpy array preds.  But we want to put prediction next to data in a dataframe to it's easy to reads.  So we have to convert it to a Pandas series first.

preds=model2.predict(X)

preds_series=pd.Series(preds)
preds_series.name="math score prediction"

In [66]:
# this just prints x, y, and preds all next to each other so easy to read

import numpy as np

n=np.column_stack([X,y,preds])


numbers = pd.DataFrame(n)

numbers

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,1,1,1,72,65
1,0,2,4,1,0,69,65
2,0,1,3,1,1,90,65
3,1,0,0,0,1,47,53
4,1,2,4,1,1,76,62
...,...,...,...,...,...,...,...
995,0,4,3,1,0,88,74
996,1,2,2,0,1,62,61
997,0,2,2,0,0,59,65
998,0,3,4,1,0,68,65


In [68]:
pd.concat([df,preds_series],axis=1)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,pass_math,math score prediction
0,0,1,1,1,1,72,72,74,1,65
1,0,2,4,1,0,69,90,88,1,65
2,0,1,3,1,1,90,95,93,1,65
3,1,0,0,0,1,47,57,44,0,53
4,1,2,4,1,1,76,78,75,1,62
...,...,...,...,...,...,...,...,...,...,...
995,0,4,3,1,0,88,99,95,1,74
996,1,2,2,0,1,62,55,55,1,61
997,0,2,2,0,0,59,71,65,0,65
998,0,3,4,1,0,68,78,77,1,65
