# Problem Set 2

## POLI 175 - Machine Learning for Political Scientists

In this problem set we will work with the Civil Conflict dataset. You can find it in [here](https://github.com/umbertomig/POLI175public/tree/main/data/mshk-pa-2017).

The file you are going to use for this PS is the `SambanisImp.csv`. The code book is also on the folder.

The full paper compares Random Forests with Logistic Regressions to predict civil conflict. You can find it in here: [full paper](https://doi.org/10.1093/pan/mpv024)

## 1. Loading Packages (1 pt)

Load with `using`:

- DataFrames
- MLJ
- MLJIteration

Load with `import`:

- MLJLinearModels
- MLJBase
- MLJModels
- MultivariateStats
- MLJMultivariateStatsInterface
- CSV
- Plots
- GLM
- StatsBase
- Random
- LaTeXStrings
- StatsPlots
- Lowess
- Gadfly
- RegressionTables
- CovarianceMatrices
- Econometrics
- LinearAlgebra
- MixedModelsExtras
- Missings
- StatsAPI
- FreqTables
- EvalMetrics
- NearestNeighborModels
- NaiveBayes
- Optim

**Hint**: If you happen to not have any of these packages, install them.

In [1]:
# Your answers here

After correctly loading the packages, the instantiated models below have to run smoothly.

In [4]:
# Solver
solver = MLJLinearModels.NewtonCG()

# Linear Regressor
linreg = MLJLinearModels.LinearRegressor();

# Logistic Classifier
logreg = MLJLinearModels.LogisticClassifier(lambda = 0, gamma = 0, solver = solver);

# Linear Discriminant Analysis
lda = MLJMultivariateStatsInterface.LDA();

# 10-NN Classifier
knn = NearestNeighborModels.KNNClassifier(K = 10)

# Naïve Bayes Classifier
nb = MLJNaiveBayesInterface.GaussianNBClassifier();

# Logistic Classifier with Lasso Penalty
# Check https://juliaai.github.io/MLJLinearModels.jl/dev/api/#MLJLinearModels.LogisticRegression
logreglasso = MLJLinearModels.LogisticClassifier(lambda = 0, gamma = 0.5, solver = solver);

# Logistic Classifier with Ridge Penalty
# Check https://juliaai.github.io/MLJLinearModels.jl/dev/api/#MLJLinearModels.LogisticRegression
logregridge = MLJLinearModels.LogisticClassifier(lambda = 0.5, gamma = 0, solver = solver);

## 2. Loading Data (1pt)

Load the dataset and save it in an object called `dat`.

In [5]:
# URL data
urldat = "https://raw.githubusercontent.com/umbertomig/POLI175julia/main/data/mshk-pa-2017/SambanisImp.csv";

In [6]:
# Your answers here

## 3. Subsetting (1pt)

There are too many variables, and some of them not useful. We are going to use only a few variables. Subset your data to only keep these variables.

Create the `X` (features) and the `y` (targets) sets.

In [8]:
## Target variable
target = "warstds";

## Predictors
predictors = [
    "ager","autonomy", "coldwar", "demch98", "dlang", "ef", 
    "expgdp", "fuelexp", "gdpgrowth", "illiteracy", 
    "infant", "lmtnest", "lpopns", "milper", "trade"];

allvars = [predictors; target];

In [9]:
# Your answers here

If you do it correctly, you should be able to unpack as below:

In [11]:
# Unpacking
y, X = unpack(
    dat,
    ==(:warstds);
    :warstds => Multiclass, 
    :ager => Continuous,
    :autonomy => Continuous,
    :coldwar => Multiclass,
    :demch98 => Multiclass,
    :dlang => Continuous,
    :ef => Continuous,
    :fuelexp => Continuous,
    :gdpgrowth => Continuous,
    :illiteracy => Continuous,
    :infant => Continuous,
    :lmtnest => Continuous,
    :lpopns => Continuous,
    :milper => Continuous,
    :trade => Continuous,
);

In [12]:
first(X, 3)

Row,ager,autonomy,coldwar,demch98,dlang,ef,expgdp,fuelexp,gdpgrowth,illiteracy,infant,lmtnest,lpopns,milper,trade
Unnamed: 0_level_1,Float64,Float64,Cat…,Cat…,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,34.4618,0.00515082,1,0,70.0,0.750797,33.5924,15.3879,0.022562,34.0206,68.6554,4.19871,16.0941,121.087,72.8814
2,34.3463,0.0,1,0,70.0,0.750797,33.5616,15.5946,0.0224471,34.1299,68.918,4.19871,16.1163,121.885,72.9001
3,77.0,0.0,1,0,70.0,0.750797,33.5771,15.6018,0.0223715,34.1833,69.0175,4.19871,16.1383,122.781,72.9629


## 4. Linear Regression (1pt)

Your target variable in here is `warstds`, which means whether a country is experiencing conflict or not. I saved it in an object called `target`. All the predictors are called `predictors`. Run a Linear Regression using MLJ to try to explain conflict based on all the predictors. Then, find the predicted outcomes and recode them with the following heuristics:

- 0 if prediction below or equal 0.5
- 1 if prediction above 0.5

Make a confusion matrix of your results. Compute the accuracy.

Do you have a good prediction model? Explain.

**Hint 1**: I instantiated the model for you in the top as `linreg`. Save your mode predictions as `y_pred_linreg`.

**Hint 2:** One way to create a binary variable is:

```julia
y_pred_lingreg = predict(my_machine, dataset) .> threshold_number
```

**Hint 3:** If you want, the F1-Score is a good metric to evaluate whether the model is good or not.

**Hint 4:** Use `Vector{Int64}(y)` instead of `y`.

In [13]:
# Your answers here

## 5. Logistic Regression (1pt)

Fit the same model using a logistic regression. Make a confusion matrix of your results; check the accuracy. Do you have a good prediction model? Explain.

**Important**: I instantiate the model for you as `logreg`. Save the mode prediction as `y_pred_logreg`.

In [20]:
# Your answers here

## 6. Linear Discriminant Analysis (1pt)

Fit the same model using a linear discriminant analysis. Make a confusion matrix of your results. Do you have a good prediction model? Explain.

**Important**: I instantiate the model for you as `lda`. Save the mode prediction as `y_pred_lda`.

In [27]:
# Your answers here

## 7. Naïve Bayes Classifier

Fit the same model using the Gaussian Naïve Bayes classifier. Make a confusion matrix and check the accuracy of your results. Do you have a good prediction model? Explain.

**Important**: I instantiate the model for you as `nb`. Save the mode prediction as `y_pred_nb`.

In [34]:
# Your answers here

## 8. K-Nearest Neighborhood Classifier (1pt)

Fit the same model using a $10$-Nearest Neighborhood Classifier. 

Make a confusion matrix and check the accuracy of your results. Do you have a good prediction model? Explain.

**Important**: I instantiate the model for you as `knn`. Save the mode prediction as `y_pred_knn`.

In [41]:
# Your answers here

## 9. Lasso Logistic Regression (1pt)

Fit a logistic regression applying a Lasso penalty.

Make a confusion matrix and check the accuracy of your results. Do you have a good prediction model? Explain.

**Important**: I instantiate the model for you as `logreglasso`. Save the mode prediction as `y_pred_logreglasso`.

In [48]:
# Your answers here

## 10. Ridge Logistic Regression (1pt)

Fit a logistic regression applying a Ridge penalty.

Make a confusion matrix and check the accuracy of your results. Do you have a good prediction model? Explain.

**Important**: I instantiate the model for you as `logregridge`. Save the mode prediction as `y_pred_logregridge`.

In [55]:
# Your answers here

**Look back to your results**.

Answer to yourself (or to us, if you like :-)

*Are any of these models good? Why?*

***

***Your comments here***

***

**Great work!**