# Data Challenge 13 ‚Äî Interpreting Logistic Regression 

**Purpose**  
Apply what you learned about logistic regression interpretation by analyzing NYC Restaurant Inspection data. 
 
You‚Äôll practice interpreting **continuous**, **binary**, and **categorical** predictors, compute **odds ratios**, and assess model accuracy. 

**Learning Goals**
- Convert coefficients to odds ratios using `np.exp()`.  
- Interpret ORs for continuous, binary, and categorical predictors.  
- Use accuracy to assess logistic regression performance.  
- Communicate results clearly and responsibly.  

**Data:** June 1, 2025 - Nov 4, 2025 Restaurant Health Inspection

[Restaurant Health Inspection](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j/about_data)


## Instructor Guidance

**Hint: Use the Lecture Deck, Canvas Reading, and Docs to help you with the code**

Use this guide live; students implement below.

**Docs (Quick Links)**
- LogisticRegression ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html  
- accuracy_score ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html  
- OneHotEncoder ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html  
- StandardScaler ‚Äî https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html  
- np.exp ‚Äî https://numpy.org/doc/stable/reference/generated/numpy.exp.html  

**Pseudocode Plan**

1Ô∏è‚É£ Load cleaned restaurant inspection data from the previous challenge.  
2Ô∏è‚É£ Define target = `IS_A` (1 = Grade A, 0 = otherwise).  
3Ô∏è‚É£ Predictors ‚Üí  
    ‚Ä¢ Continuous = `SCORE`  
    ‚Ä¢ Binary = `CRITICAL_NUM`  
    ‚Ä¢ Categorical = `BORO`  
4Ô∏è‚É£ Scale continuous variables; encode categorical ones.  
5Ô∏è‚É£ Fit `LogisticRegression`.  
6Ô∏è‚É£ Exponentiate coefficients (np.exp()) ‚Üí odds ratios.  
7Ô∏è‚É£ Interpret one continuous, one binary, and one categorical coefficient.  
8Ô∏è‚É£ Evaluate accuracy.  
9Ô∏è‚É£ Reflect on scaling choices and communication of odds.  


## You Do ‚Äî Student Section
Work in pairs. Comment your choices briefly. Keep code simple‚Äîonly coerce the columns you use.

## Step 1 ‚Äî Imports and Plot Defaults

In [1]:
import pandas as pd, numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from pathlib import Path
pd.set_option('display.float_format', lambda x: f'{x:,.4f}')

### Step 2 ‚Äî Load CSV, Create Columns, Preview

- Point to your New York City Restaurant Inspection Data 
- Create the `is_A` and `critical_num` columns like you did in L11 notebook

In [2]:
path = '/Users/Marcy_Student/Desktop/Marcy-Modules/Mod6/data/DOHMH_New_York_City_Restaurant_Inspection_Results_20251110.csv'
df = pd.read_csv(path)
df

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,...,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Location
0,50167878,GOLDEN STEAMER I INC.,Manhattan,143,MOTT STREET,10013.0000,6465231688,,01/01/1900,,...,,40.7187,-73.9966,102.0000,1.0000,4100.0000,1079581.0000,1002370019.0000,MN24,POINT (-73.996645049413 40.718681310365)
1,50168599,THAI FLAVOR 88 INC.,Manhattan,174,2 AVENUE,10003.0000,2122542868,,01/01/1900,,...,,40.7305,-73.9863,103.0000,2.0000,4000.0000,1077704.0000,1004530001.0000,MN22,POINT (-73.986296382711 40.730463823842)
2,50162584,COZY TEA LOFT,0,141,STATE ROUTE 27,8820.0000,3472619435,,01/01/1900,,...,,,,,,,,,,
3,50174672,EL PALENQUE MEXICAN RESTAURANT CORPORATION,Brooklyn,181,WEST END AVENUE,11235.0000,7182553580,,01/01/1900,,...,,40.5773,-73.9530,315.0000,48.0000,62000.0000,3245985.0000,3087320012.0000,BK17,POINT (-73.952961276652 40.577340234075)
4,50155679,ZADDY'S JERK CHICKEN,Brooklyn,686,HEGEMAN AVENUE,11207.0000,7187752616,,01/01/1900,,...,,40.6621,-73.8866,305.0000,42.0000,110400.0000,3097445.0000,3043290001.0000,BK82,POINT (-73.886623536611 40.662080196538)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291700,41658103,CHOP-SHOP,Manhattan,254,10 AVENUE,10001.0000,2128200333,Asian/Asian Fusion,01/25/2023,Violations were cited in the following area(s).,...,Cycle Inspection / Re-inspection,40.7488,-74.0034,104.0000,3.0000,9300.0000,1012823.0000,1007220076.0000,MN13,POINT (-74.003374516952 40.748763929496)
291701,50096822,SOHO THAI,Manhattan,141,GRAND STREET,10013.0000,2129668916,Thai,02/20/2024,Violations were cited in the following area(s).,...,Cycle Inspection / Re-inspection,40.7202,-73.9995,102.0000,1.0000,4500.0000,1003045.0000,1002330012.0000,MN24,POINT (-73.999502156932 40.720240386394)
291702,50016367,EL MANATIAL,Queens,104-21,ROOSEVELT AVENUE,11368.0000,7185050250,Latin American,08/28/2025,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,40.7501,-73.8607,403.0000,21.0000,40300.0000,4307628.0000,4017760061.0000,QN26,POINT (-73.860725144918 40.750139665657)
291703,41236413,DUNKIN,Brooklyn,1575,FLATBUSH AVENUE,11210.0000,3474057014,Donuts,03/18/2025,Violations were cited in the following area(s).,...,Cycle Inspection / Initial Inspection,40.6324,-73.9472,314.0000,45.0000,78600.0000,3205908.0000,3075580031.0000,BK42,POINT (-73.947175876843 40.632384405769)


## Step 3 ‚Äî Define Predictors & Target

- Target is `is_A` 
- X predictors are: SCORE, CRITICAL_NUM (created in Step 2), BORO


In [3]:
# cleaning our data
df_cleaned = df.dropna(subset=['SCORE', 'GRADE', 'CRITICAL FLAG', 'BORO'])
df_cleaned['GRADE']

12        A
19        A
22        A
33        P
71        C
         ..
291697    A
291698    A
291700    B
291701    A
291703    A
Name: GRADE, Length: 142309, dtype: object

In [4]:
df_cleaned['CRITICAL FLAG'].value_counts()

CRITICAL FLAG
Critical          71637
Not Critical      70063
Not Applicable      609
Name: count, dtype: int64

In [5]:
df_cleaned['CRITICAL FLAG'] = df_cleaned['CRITICAL FLAG'].astype(str)
df_cleaned['CRITICAL FLAG'] = df_cleaned['CRITICAL FLAG'].str.strip().str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['CRITICAL FLAG'] = df_cleaned['CRITICAL FLAG'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['CRITICAL FLAG'] = df_cleaned['CRITICAL FLAG'].str.strip().str.lower()


In [6]:
crit_num = []

for x in df_cleaned['CRITICAL FLAG']:
    if x == "critical":
        crit_num.append(1)
    elif x == "not applicable":
        crit_num.append(2)
    else:
        crit_num.append(0)
    
df_cleaned['CRITICAL_NUM'] = crit_num

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['CRITICAL_NUM'] = crit_num


In [None]:
df_cleaned = df_cleaned[df_cleaned['CRITICAL_NUM']<2]
df_cleaned['CRITICAL_NUM'].value_counts()

CRITICAL_NUM
1    71637
0    70063
Name: count, dtype: int64

In [None]:
# y = 'is_A'
df_cleaned['is_A'] = (df_cleaned['GRADE']=='A').astype(int)
y = df_cleaned['is_A']

# X = multiple variables        X1 = score --> continuous       X2 = critical flag --> binary       X3 = borough --> categorical
X1 = sm.add_constant(df_cleaned['SCORE'])
X2 = sm.add_constant(df_cleaned['CRITICAL_NUM'])
X3 = sm.add_constant(df_cleaned['BORO'])

## Step 4 ‚Äî Split Data (70/30 Stratify by Target)

In [None]:
x1_train, x1_test, y1_train, y1_test = train_test_split(X1, y, train_size=0.7, test_size=0.3, stratify=y, random_state=42)
x2_train, x2_test, y2_train, y2_test = train_test_split(X2, y, train_size=0.7, test_size=0.3, stratify=y, random_state=42)
x3_train, x3_test, y3_train, y3_test = train_test_split(X3, y, train_size=0.7, test_size=0.3, stratify=y, random_state=42)

## Step 5 ‚Äì Preprocessing (You can chose to do this in a Pipeline)  

- Scale continuous features  
- Pass binary as is  
- One-hot encode categorical feature (`BORO`)  

In [None]:
# scaling continuous variable
scaler = StandardScaler()
# x2 train
x1_train_scaled = scaler.fit_transform(x1_train)
#x2 test
x1_test_scaled = scaler.fit_transform(x1_test)


# one-hot encode categorical variable
x3_train_encoded = pd.get_dummies(x3_train, columns=['BORO'], prefix='Borough')
x3_test_encoded = pd.get_dummies(x3_test, columns=['BORO'], prefix='Borough')

## Step 6 ‚Äì Fit Model & Evaluate Accuracy

- Fit `is_A ~ score` using **LogisticRegression**  
- Compute predictions with `.predict()`  
- Evaluate accuracy with `accuracy_score()`

In [11]:
# our model
LogReg = LogisticRegression()

# Variables 1: Continuous
model1 = LogReg.fit(x1_train_scaled, y1_train)
y1_pred = model1.predict(x1_test_scaled)
print(f"Model 1 Accuracy: {accuracy_score(y1_test, (y1_pred >= 0.5).astype(int))}")

# Variables 2: Binary
model2 = LogReg.fit(x2_train, y2_train)
y2_pred = model2.predict(x2_test)
print(f"Model 2 Accuracy: {accuracy_score(y2_test, (y2_pred >= 0.5).astype(int))}")

# Variables 3: Categorical
model3 = LogReg.fit(x3_train_encoded, y3_train)
y3_pred = model3.predict(x3_test_encoded)
print(f"Model 3 Accuracy: {accuracy_score(y3_test, (y3_pred >= 0.5).astype(int))}")

Model 1 Accuracy: 0.9716772524111974
Model 2 Accuracy: 0.6774876499647142
Model 3 Accuracy: 0.6774876499647142


## Step 7 ‚Äì Extract Coefficients and Convert to Odds Ratios


In [None]:
# coefficient + intercept for Model 1
coef1 = model1.coef_[0][0]
intercept1 = model1.intercept_[0]

print(f"Intercept (Log-Odds): {intercept1:.4f}")
print(f"Coefficient (Log-Odds): {coef1:.4f}")

Intercept (Log-Odds): 0.3487
Coefficient (Log-Odds): 0.3480


In [None]:
# coefficient + intercept for Model 2
coef2 = model2.coef_[0][0]
intercept2 = model2.intercept_[0]

print(f"Intercept (Log-Odds): {intercept2:.4f}")
print(f"Coefficient (Log-Odds): {coef2:.4f}")

Intercept (Log-Odds): 0.3487
Coefficient (Log-Odds): 0.3480


In [None]:
# coefficient + intercept for Model 3
coef3 = model3.coef_[0][0]
intercept3 = model3.intercept_[0]

print(f"Intercept (Log-Odds): {intercept3:.4f}")
print(f"Coefficient (Log-Odds): {coef3:.4f}")

Intercept (Log-Odds): 0.3487
Coefficient (Log-Odds): 0.3480


## Step 8 ‚Äì Interpret Each Predictor 

**Remember**
üí° OR > 1 ‚Üí increases odds of Grade A  
üí° OR < 1 ‚Üí decreases odds of Grade A

**Type markdown interpreting all 3 predictors in plain english**


# We Share ‚Äî Reflection & Wrap-Up

Write **one short paragraphs** (4‚Äì6 sentences). Be specific and use evidence from your notebook.

**Which predictor had the strongest relationship with getting an A grade?**  
Use the odds ratios and accuracy to support your answer.  