# Logistic Regression Model for Q2

This is a sample Logistic Regression Model built on top of the dataset for Q2 - based on the features we saw that correlated heavily with acceptance rates.

TL;DR - Model is only slightly better than guessing no-acceptance all the time. Probably needs some tweaking.

This is just a sample of how I'd roughly do a simple model in Python.

The top key factors for a jobs acceptance from the [tableau report](https://public.tableau.com/app/profile/william8331/viz/shared/PRQT7Y97N) were:
1. Jobs posted after 12pm
2. Category of the job
3. Medium vs Small Jobs
4. City of the job
5. Number of Tradies job was sent to

## Data Prep
Cleanse data and build features

In [88]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [89]:
# Method for categorizing the city based on our rules-based approach as noted in the tableau report.
def categorize_city(latitude):
    if latitude > -18.5 and latitude < -17.5:
        return 'Broome'
    elif latitude > -35 and latitude < -33:
        return 'Sydney'
    else:
        return 'Melbourne'

In [90]:
df=pd.read_csv('data_q2.csv')

In [91]:
df.dtypes

time_of_post              object
latitude                 float64
 longitude               float64
category                   int64
number_of_tradies          int64
estimated_size            object
number_of_impressions    float64
accepted                   int64
dtype: object

In [92]:
# Remove bad data in impressions
df=df[df['number_of_impressions']>=0]

# Convert time of post to a datetime
df['time_of_post'] = pd.to_datetime(df['time_of_post'])

# Make the Category integer into a categorical string
df['category_text']=df['category'].astype(str)

# Apply the custom function to the 'longitude' column
df['city'] = df['latitude'].apply(categorize_city)

# Create the after 12pm flag
df['after_12pm_flag'] = df['time_of_post'].dt.hour >= 12

In [93]:
# Define the bins for bucketing (intervals of 1000)
bins = range(0, df['number_of_tradies'].max() + 1000, 1000)

# Use pd.cut() to categorize the 'number_of_tradies' into these bins
df['tradie_bucket'] = pd.cut(df['number_of_tradies'], bins=bins, right=False, labels=[f'{i}-{i+999}' for i in bins[:-1]])

In [94]:
# select only relevant columns
columns=['after_12pm_flag','tradie_bucket','city','category_text','estimated_size','accepted']
df_input=df[columns]

# print out df_input. This is going to be fed into our model
df_input

Unnamed: 0,after_12pm_flag,tradie_bucket,city,category_text,estimated_size,accepted
0,False,8000-8999,Melbourne,8,medium,0
1,True,5000-5999,Melbourne,3,medium,0
2,True,3000-3999,Melbourne,7,small,0
3,False,9000-9999,Sydney,3,medium,1
4,True,2000-2999,Sydney,6,small,0
...,...,...,...,...,...,...
9993,True,1000-1999,Sydney,1,medium,0
9995,True,8000-8999,Melbourne,6,small,0
9996,True,9000-9999,Sydney,3,small,1
9997,False,4000-4999,Sydney,7,medium,0


## Logistic Regression Modelling
A simple explainable model for inferencing probability of acceptance

In [95]:
# Define features and target variable
X = df_input.drop(columns=['accepted'])  # Drop non-numeric and target columns
y = df_input['accepted']

# Convert categorical variables to dummy variables if necessary
X = pd.get_dummies(X, drop_first=True)

X

Unnamed: 0,after_12pm_flag,tradie_bucket_1000-1999,tradie_bucket_2000-2999,tradie_bucket_3000-3999,tradie_bucket_4000-4999,tradie_bucket_5000-5999,tradie_bucket_6000-6999,tradie_bucket_7000-7999,tradie_bucket_8000-8999,tradie_bucket_9000-9999,...,city_Sydney,category_text_2,category_text_3,category_text_4,category_text_5,category_text_6,category_text_7,category_text_8,category_text_9,estimated_size_small
0,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False
1,True,False,False,False,False,True,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
2,True,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
3,False,False,False,False,False,False,False,False,False,True,...,True,False,True,False,False,False,False,False,False,False
4,True,False,True,False,False,False,False,False,False,False,...,True,False,False,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9993,True,True,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
9995,True,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,True,False,False,False,True
9996,True,False,False,False,False,False,False,False,False,True,...,True,False,True,False,False,False,False,False,False,True
9997,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,True,False,False,False


In [96]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7756147540983607
Confusion Matrix:
 [[1973  170]
 [ 487  298]]
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.92      0.86      2143
           1       0.64      0.38      0.48       785

    accuracy                           0.78      2928
   macro avg       0.72      0.65      0.67      2928
weighted avg       0.76      0.78      0.75      2928



- 0 = not accepted, 1 = accepted
- The accuracy of the model only performs slightly better than just guessing 0. (73%)
- Recall is particularly bad for acceptance. Only caught 38% of all the instances where an acceptance occurred. 
- Probably needs fine tuning

In [97]:
### Inferencing the probability for one example
y_pred_prob = model.predict_proba([X_test[1]])
prob_for_accepted = y_pred_prob[:,1]
print(prob_for_accepted)

[0.0994961]


## Ranking Feature Importance
Understanding which features can explain variations in the target variable the best

In [98]:
# Access the model's coefficients (feature importance)
feature_importance = model.coef_[0]  # Coefficients for each feature

# Combine feature names with their importance (coefficients)
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': feature_importance
})

# Sort the features by their importance (absolute value of coefficients)
importance_df['Absolute Coefficient'] = importance_df['Coefficient'].abs()
importance_df = importance_df.sort_values(by='Absolute Coefficient', ascending=False)

# Print the feature importance
print(importance_df)

                    Feature  Coefficient  Absolute Coefficient
0           after_12pm_flag     1.066756              1.066756
20     estimated_size_small    -0.711479              0.711479
8   tradie_bucket_8000-8999     0.519104              0.519104
9   tradie_bucket_9000-9999     0.436954              0.436954
5   tradie_bucket_5000-5999     0.409946              0.409946
7   tradie_bucket_7000-7999     0.394929              0.394929
3   tradie_bucket_3000-3999     0.187981              0.187981
15          category_text_5    -0.173029              0.173029
12          category_text_2    -0.156403              0.156403
4   tradie_bucket_4000-4999     0.136110              0.136110
14          category_text_4    -0.131803              0.131803
16          category_text_6    -0.115528              0.115528
10           city_Melbourne    -0.114840              0.114840
19          category_text_9     0.108789              0.108789
2   tradie_bucket_2000-2999     0.093160              0