# Logistic Regression Model for Q2

This is a sample Logistic Regression Model built on top of the dataset for Q2 - based on the features we saw that correlated heavily with acceptance rates.

TL;DR - Model is only slightly better than guessing no-acceptance all the time. Probably needs some tweaking.

This is just a sample of how I'd roughly do a simple model in Python.

The top key factors for a jobs acceptance from the [tableau report](https://public.tableau.com/app/profile/william8331/viz/shared/PRQT7Y97N) were:
1. Jobs posted after 12pm
2. Category of the job
3. Medium vs Small Jobs
4. City of the job
5. Number of Tradies job was sent to

## Data Prep
Cleanse data and build features

In [99]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [100]:
# Method for categorizing the city based on our rules-based approach as noted in the tableau report.
def categorize_city(latitude):
    if latitude > -18.5 and latitude < -17.5:
        return 'Broome'
    elif latitude > -35 and latitude < -33:
        return 'Sydney'
    else:
        return 'Melbourne'

In [101]:
df=pd.read_csv('data_q2.csv')

In [102]:
df.dtypes

time_of_post              object
latitude                 float64
 longitude               float64
category                   int64
number_of_tradies          int64
estimated_size            object
number_of_impressions    float64
accepted                   int64
dtype: object

In [103]:
# Remove bad data in impressions
df=df[df['number_of_impressions']>=0]

# Convert time of post to a datetime
df['time_of_post'] = pd.to_datetime(df['time_of_post'])

# Make the Category integer into a categorical string
df['category_text']=df['category'].astype(str)

# Apply the custom function to the 'longitude' column
df['city'] = df['latitude'].apply(categorize_city)

# Create the after 12pm flag
df['after_12pm_flag'] = df['time_of_post'].dt.hour >= 12

In [104]:
# select only relevant columns
columns=['after_12pm_flag','number_of_tradies','city','category_text','estimated_size','accepted']
df_input=df[columns]

# print out df_input. This is going to be fed into our model
df_input

Unnamed: 0,after_12pm_flag,number_of_tradies,city,category_text,estimated_size,accepted
0,False,8376,Melbourne,8,medium,0
1,True,5089,Melbourne,3,medium,0
2,True,3677,Melbourne,7,small,0
3,False,9732,Sydney,3,medium,1
4,True,2476,Sydney,6,small,0
...,...,...,...,...,...,...
9993,True,1542,Sydney,1,medium,0
9995,True,8260,Melbourne,6,small,0
9996,True,9732,Sydney,3,small,1
9997,False,4242,Sydney,7,medium,0


## Logistic Regression Modelling
A simple explainable model for inferencing probability of acceptance

In [105]:
# Define features and target variable
X = df_input.drop(columns=['accepted'])  # Drop non-numeric and target columns
y = df_input['accepted']

# Convert categorical variables to dummy variables if necessary
X = pd.get_dummies(X, drop_first=True)

X

Unnamed: 0,after_12pm_flag,number_of_tradies,city_Melbourne,city_Sydney,category_text_2,category_text_3,category_text_4,category_text_5,category_text_6,category_text_7,category_text_8,category_text_9,estimated_size_small
0,False,8376,True,False,False,False,False,False,False,False,True,False,False
1,True,5089,True,False,False,True,False,False,False,False,False,False,False
2,True,3677,True,False,False,False,False,False,False,True,False,False,True
3,False,9732,False,True,False,True,False,False,False,False,False,False,False
4,True,2476,False,True,False,False,False,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9993,True,1542,False,True,False,False,False,False,False,False,False,False,False
9995,True,8260,True,False,False,False,False,False,True,False,False,False,True
9996,True,9732,False,True,False,True,False,False,False,False,False,False,True
9997,False,4242,False,True,False,False,False,False,False,True,False,False,False


In [106]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7810792349726776
Confusion Matrix:
 [[1977  166]
 [ 475  310]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.92      0.86      2143
           1       0.65      0.39      0.49       785

    accuracy                           0.78      2928
   macro avg       0.73      0.66      0.68      2928
weighted avg       0.76      0.78      0.76      2928



- 0 = not accepted, 1 = accepted
- The accuracy of the model only performs slightly better than just guessing 0. (73%)
- Recall is particularly bad for acceptance. Only caught 38% of all the instances where an acceptance occurred. 
- Probably needs fine tuning

In [107]:
### Inferencing the probability for one example
y_pred_prob = model.predict_proba([X_test[1]])
prob_for_accepted = y_pred_prob[:,1]
print(prob_for_accepted)

[0.0957528]


## Ranking Feature Importance
Understanding which features can explain variations in the target variable the best

In [108]:
# Access the model's coefficients (feature importance)
feature_importance = model.coef_[0]  # Coefficients for each feature

# Combine feature names with their importance (coefficients)
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': feature_importance
})

# Sort the features by their importance (absolute value of coefficients)
importance_df['Absolute Coefficient'] = importance_df['Coefficient'].abs()
importance_df = importance_df.sort_values(by='Absolute Coefficient', ascending=False)

# Print the feature importance
print(importance_df)

                 Feature  Coefficient  Absolute Coefficient
0        after_12pm_flag     1.080088              1.080088
1      number_of_tradies     0.736091              0.736091
12  estimated_size_small    -0.716592              0.716592
4        category_text_2    -0.033984              0.033984
11       category_text_9    -0.025938              0.025938
9        category_text_7     0.025786              0.025786
6        category_text_4     0.016909              0.016909
10       category_text_8    -0.013143              0.013143
7        category_text_5     0.010122              0.010122
3            city_Sydney    -0.006853              0.006853
8        category_text_6    -0.004571              0.004571
2         city_Melbourne     0.004548              0.004548
5        category_text_3     0.001321              0.001321


- The after 12pm flag seems to explain alot of the variation
- The estimate size and the Number of tradies also seem to play a part.