# Chapter 1

- Response variable = dependent variable = y
- Explanatory variables = independent variables = X
- Linear regression = The response variable is numeric
- Logistic regression = The response variable is logical.
- Simple linear/logistic regression = There is only one explanatory variable


```
# You want a regression line : y = mx + c
from statsmodels.formula.api import ols
# Formula applied : y = x, finds m and c on its own for best fit
num_model = ols("y ~ x_num", data=df).fit()
# Calculate co-efficient of each category with relative to 0 instead of relative to intercept, its the mean for each category
cat_model = ols("y ~ x_cat + 0", data=df).fit()
# See model params : slope m and intercept c
print(cat_model.params)
# See prediction on original fitted data 
print(num_model.fittedvalues)
# See residuals
print(model.resid)
# Model summary
model.summary()
# R-squared
print(model.rsquared)
# Residual mean squared error
rse = np.sqrt(model.mse_resid)
# Create test data
test_data = pd.DataFrame({"x_num": np.arange(20, 41)})
# Predict on test data
print(num_model.predict(test_data))
# Summary values
summary_df = model.get_influence().summary_frame()
# Leverage
df["leverage"] = summary_df["hat_diag"]
# Cooks distance
summary_df["cooks_dist"] = summary_df["cooks_d"]
# Residual plot
sns.residplot(x="X", y="y", data=bream, lowess=True)
# QQ plot
from statsmodels.api import qqplot
qqplot(data=model.resid, fit=True, line="45")
# Scale location plot
residual_abs_squared = model.get_influence().resid_studentized_internal
residual_measured = np.sqrt(residual_abs_squared)
sns.regplot(x=model.fittedvalues, y=residual_measured, ci=None, lowess=True)
# NOTE : You can transform the X and y before fitting and then train. You can predict with the model and the predicted values should be back-transformed for y if y was transformed.
```

# Chapter 2

- Response value = fitted value + residuals
- Regression to the mean : extreme cases don't persist over time, they occur due to randomness 
- Non-linear data may work well with transformation
- You may have to transform X, y or both before training or fitting. You need to reverse the transformation result after prediction to interpret the result

# Chapter 3

### Quantifying model performance

- Classification:
    - Accuracy is not a good metric for all cases. eg: class imbalanced dataset
    - accuracy = (TP + TN)/ (TP + TN + FP + FN)
    - precision = TP / (TP + FP)
    - High precision = lower false positive rate
    - recall = TP / (TP + FN)
    - High recall = lower false negative rate
    - f1 = 2 * (precision * recall) / (precision + recall)
    - ROC AUC = area under the curve of TP rate vs FP rate graph
    - jaccard score
    - log loss
- Regression:
    - r-squared : Percentage of variability of y explained by independent variable
    - RMSE : Root mean squared error. Average error in target
    - RSE : Residual standard error. Same as RMSE, except instead of length (n), it requires degree of freedom (n-1)
    - MAE : Average absolute error in target
    - VResidual vs fit plot : Fitted values on X axis vs Residual on Y axis. Good model has positive and negative values distributed evenly on both side of a line (gaussian noise).
    - Q-Q Plot : normal distribution on X axis vs dataset distribution on Y axis. Good model has linear relationship of equation line Y= mX + c
    - Distance location plot : Fitted values on X-axis vs root of standardized residuals on Y-axis. 
    - measurement of extreme values (outliers):
        - leverage : measurement of how extreme the explanatory variable values are
        - influence : how much the model would change if you leave the observation out of the dataset when modeling. (eg : cooks distance)
- Hyperparameter tuning = 
    - Hyperparameters are Parameters we specify before fitting the model
    - We compare model outcomes by changing these hyperparametes 
    - Use cross-validation to avoid overfitting
    - example : gridsearch
- Use boxplot of cross-validation results of different models to compare their distribution of scores

```
# Regression Performance Measurement
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
r2_score(y_true, y_pred) # R-squared
mse = mean_squared_error(y_true, y_pred) # MSE
rmse = mean_squared_error(y_test, y_pred, squared=False) # RMSE
mae = mean_absolute_error(y_true, y_pred) # MAE

# Classification Performance Measurement
from sklearn.metrics import classification_report, confusion_matrix, jaccard_score, log_loss, roc_auc_score, roc_curve, f1_score
confusion_matrix(y_test, y_pred) # Confusion matrix
classification_report(y_test, y_pred) # TP, FP, TN, FN
jaccard_score(y_test, y_pred,pos_label=0) # Jaccard score
log_loss(y_test, y_pred_prob) # log loss
print(roc_auc_score(y_test, y_pred_prob)) # ROC AUC
print(f1_score(y_true, y_pred)) # F1 Score

# Visualize ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()

# Grid-search example for hyperparameter tuning of classification
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {"alpha": np.arange(0.0001, 1, 10), "solver": ["sag", "lsqr"]}
ridge = Ridge()
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)
ridge_cv2 = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_)

### Compare different models distribution
results = {"Model 1": model1_cv_results, "Model 2": model2_cv_results, "Model 3": model3_cv_results}
plt.boxplot(results.values(), labels=results.keys())
plt.show()

# Leverage : measurement of how extreme the explanatory variable values are
leverage = model.get_influence().hat_matrix_diag
# Influence : how much the model would change if you leave the observation out of the dataset when modeling. (eg : cooks distance)
influence = model.get_influence().resid_studentized_external
cooks_distance = model.get_influence().cooks_distance[0]

# Residualplot for regression
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, color='blue', alpha=0.6)
plt.axhline(y=0, color='red', linestyle='--', linewidth=2)
plt.xlabel('Fitted Values (Predicted)')
plt.ylabel('Residuals')

# Q-Q plot for regression
from scipy.stats import probplot
probplot(residuals.flatten(), dist='norm', plot=plt)
plt.xlabel('Theoretical Quantiles')
plt.ylabel('Sample Quantiles')
plt.show()

# Scale location plot
plt.scatter(y_pred, np.sqrt(np.abs(residuals)), color='blue', alpha=0.6)
plt.xlabel('Fitted Values (Predicted)')
plt.ylabel('Square Root of Absolute Residuals')
plt.axhline(y=np.mean(np.sqrt(np.abs(residuals))), color='red', linestyle='--', linewidth=2, label='Mean')
plt.legend()
plt.show()
```

# Chapter 4

### Logistic Regression

```
from statsmodels.formula.api import logit
model = logit("target ~ x_var", data=df).fit()
print(model.params)
# Visualize logistic model
sns.regplot(x="x_var", y="target", data=df, ci=None, logistic=True)
X_test = pd.DataFrame({"x_var": np.arange(-1, 6.25, 0.25)})
y_pred_prob = model.predict(X_test)
y_pred = np.round(y_pred_prob)
# Odds ratio : p/(1-p) or probability of something happenning over not happening
odds_ratio = y_pred_prob / (1- y_pred_prob)
# Visualize odds ratio / log odds ratio : How likely or unlikely an occurrence may happen
sns.lineplot(x="x_var", y="odds_ratio", data=df)
plt.axhline(y=1, linestyle="dotted")
plt.yscale("log") # If you want to make the curve into linear make y : np.log(odds_ratio)
plt.show()
# Confusion matrix
conf_matrix = model.pred_table()
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
# Visualize confusion matrix
from statsmodels.graphics.mosaicplot import mosaic
mosaic(conf_matrix)

```