## Machine Learning
***
#### Problem Set 1
#### Varvara Ilyina
#### 2024-02-28

The following exercises have been solved with the help of Chapters 1, 4 and 5 in _Probabilistic Machine Learning: An Introduction_ by Kevin Murphy.'', as well as some explanatory YouTube videos. Additionally, various internet resources and ChatGPT were used to aid me in generating the correct commands for my Python code.

***
***

##### 1. The particular task we will be considering is predicting `vote` based on five features: `TVnews`, `PID`, `age`, `educ`, and `popul`. Calculate summary statistics for the label and five features described above. Pay attention to the meaning of each variable and present a summary of it that makes sense given how it is coded.

In [11]:
# load packages
import statsmodels.api as sm
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# load dataset
data = sm.datasets.anes96.load_pandas().data

# extract necessary columns
features = ['TVnews', 'PID', 'age', 'educ', 'popul']
label = 'vote'
df = data[features + [label]]

df.shape

(944, 6)

In [13]:
# split data into training and testing sets
X = df[features]
y = df[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# make predictions on the test set
y_pred = model.predict(X_test)

# evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.10158756592995671


In [14]:
# print summary statistics
summary = df.describe()
print(summary)

           TVnews         PID         age        educ        popul        vote
count  944.000000  944.000000  944.000000  944.000000   944.000000  944.000000
mean     3.727754    2.842161   47.043432    4.565678   306.381356    0.416314
std      2.677235    2.273337   16.423130    1.599287  1082.606745    0.493208
min      0.000000    0.000000   19.000000    1.000000     0.000000    0.000000
25%      1.000000    1.000000   34.000000    3.000000     1.000000    0.000000
50%      3.000000    2.000000   44.000000    4.000000    22.000000    0.000000
75%      7.000000    5.000000   58.000000    6.000000   110.000000    1.000000
max      7.000000    6.000000   91.000000    7.000000  7300.000000    1.000000


*

##### 2. What is the formula for the closed form estimate of the coefficient vector in ordinary least squares regression? Estimate the coefficients using numpy in Python by performing the matrix operations from the closed form solution we worked out in class.

*

__Solution:__ \
``

##### 3. Estimate the coefficients using the statsmodels package (sm.OLS documentation). Compare them.

*

__Solution:__ \
``

##### 4. Now, think about the model you just estimated. In class, we talked about two assumptions we could use to motivate estimation of the variance of this coefficient vector. Which would you choose and why?

*

__Solution:__ \
``

##### 5. Estimate the variance of these coefficients using the matrix formula.

*

##### 6. Create a table showing, for each feature, _j_, the estimate ($\hat{β}$<sub>j</sub>), the standard error $\sqrt{\hat{Var}(\hat{β})}$<sub>jj</sub>, and the upper and lower bounds of the 95% confidence interval ($\hat{β}$<sub>j</sub> $± z$<sub>$\alpha$</sub>$\sqrt{\hat{Var}(\hat{β})}$<sub>jj</sub>). Compare the variance to what you got from statsmodels. What assumption are they using on the variance?

*

##### 7. Write a function with three arguments:
* `beta`: A 1D numpy array representing a particular value of your coefficients, $β$.
* `label`: A 1D numpy array of the labels in your dataset.
* `features`: A 2D numpy array representing the features in your dataset.
##### This function should output a single number, the negative log-likelihood evaluated at the chosen value of $β$.

##### 8. Using the `SciPy` library, minimize the objective function we discussed in class for logistic regression.

In [None]:
opt_result = scipy.optimize.minimize(
nll, args=(X, y), x0 = [0] * 6, method='BFGS'
)
beta_logistic = opt_result.x

##### 9. Now you can construct your predictions by taking the dot-product between beta logistic and your feature matrix and then passing that dot-product through the sigmoid function. This provides an estimate of the probability of class membership. Also calculate the most likely class for each unit by predicting a 1 when $p(y$<sub>i</sub>$|x$<sub>i</sub>$; β) > 0.5$ (i.e. the Heaviside function).

##### 10. Construct class estimates for your OLS predictions as well by calculating $1(Xβ$<sub>ols</sub>$ > 0.5)$ (i.e. output a $1$ if the OLS predicted value is greater than $0.5$).

##### 11. Calculate the full confusion matrix for the logistic regression and the OLS model.

##### 12. Plot the relationship between the predictions from the linear regression in Question 1 (on the x-axis) and the predictions from the logistic regression (on the y-axis). What do you see?

##### 13. Separating users into communities based on the kind of stories they engage with and other users they interact with.

##### 14. Predicting whether users will click on a story or not based on their past behavior.

##### 15. Choosing which story to show a user in order to keep them active on the platform for longer.