# IDS 575: Assignment 2

- Turn in solutions as a single notebook (ipynb) _and_ as a pdf on Blackboard. No need to turn in datasets/word-docs.
- Answer the following questions concisely, in complete sentences and with full clarity. Across group collaboration is _strictly_ not allowed. Always cite all your sources.
- Make appropriate assumptions necessary in order to answer the questions, and mention them explicitly at the beginning of each answer.

### 1. Gradient Descent (6pt)
1. Implement the Batch Gradient Descent (BGD) procedure. 
  - It should take as inputs:
    - the number of epochs (an integer)
    - the starting/initial point (a numpy array)
    - the learning rate (a real value), and
    - the gradient function (a function that take a point as input and produces the gradient of the loss with respect to the point as output), and its necessary inputs.
  - It should return:
    - all iterates (as a list/numpy array),
    - the final iterate (as a numpy array), and
    - the absolute and relative changes corresponding to the final iterate (make suitable assumptions if needed).
2. Write the gradient function (G) corresponding to linear model and the squared loss.
  - It should take as inputs:
    - a regression dataset (a pandas dataframe), and
    - the point (a numpy array) at which the gradient needs to be evaluated
  - It should return:
    - the gradient of the squared loss with respect to the input point.
3. Consider a simple regression setting. 
  1. Generate a regression dataset of appropriate size using make_regression function from scikit-learn. It should have one feature and one target variable.
  2. Evaluate BGD when 
    - the number of epochs is 1000,
    - the starting point is numpy.random.randn(2),
    - the learning rate is 0.01, and
    - the gradient function is G (which uses the dataset generated in the previous step)
  3. Plot the magnitude of the gradients as a function of epoch (e.g., using matplotlib). Is there a pattern to the magnitudes?
  4. Plot the iterates on the 2D plane. Do they visually indicate convergence?
  5. Plot the loss function with respect to the slope and intercept parameters. Is the function convex?
  6. Vary the epochs and learning rates and redo steps C-E above. Compare and contrast your results to the settings specified in step B.
4. Implement the Stochastic Gradient Descent (SGD) procedure. Redo steps B-F in part 3 and comment on how this procedure may yield different results compared to BGD.
5. Implement the Mini-batch Gradient Descent (MBGD) procedure where the batch size is an input. Redo steps B-F in part 3 for two different batch sizes, and comment on how this procedure may yield different results compared to SGD and BGD.
6. Modify the loss function to be $$(h_{\theta}(x)-y)^4$$ and repeat steps B-F in part 3.

### 2. Classification using Regression (5pt)

1. Download the [Adult dataset](https://drive.google.com/file/d/1UJ45CQBg0wJh0KbxBzqTh9s798KtfxLO/view?usp=sharing) and read it into a dataframe using pandas.
  - Display the top few rows of the dataframe using `head()`.
  - Separate the variables into those that are categorical and those that are numeric and display them. 
  - For the variables that are categorical, report the unique values and their counts. 
  - For each feature, also plot the normalized distribution of occurance of the values using a bar plot, after sorting them according to their occurance.
6. Our primary aim is to predict the `class` column using the other attributes. That is, we want to predict whether a person earns over 50K a year from heterogeneous data such as age, employment, education, family information, etc. This is a classification problem, and our objective in this question is to solve for it using linear regression.
  - Describe how we can build a classifier using a linear model.
  - Is the dataset balanced? Why or why not?
  - Write a function from scratch to split the dataset 80:20 into training and validation sets. Report the number of observations obtained in each set.
8. Perform basic exploratory analysis by computing and visualizing correlations between features. 
    - Draw the plot of correlations between every pair of features. 
    - Which features are highly correlated one another?
    - Is the data matrix tall or wide?
9. Perform linear regression using the above BGD (or SGD or MBGD, your choice) procedure. 
    - What is the the output variable for regression? Is it the same as the `class` column? If not, how is it related to it?
    - Is there any predictability in predicting the output variable in our dataset?
    - Plot the histogram of residuals using pandas directly (e.g., `df.hist(bins=...)`). What number of bins was useful for your visualization? 
    - Are the residuals normally distributed? What is their mean and variance?
    - Compare the solution obtained with the one that can be derived using normal equations (compute this using numpy). That is, 
    $$\theta_{analytical} = (X^{T}X)^{-1}X^{T}y.$$
    - Computationally show that $ X^{T}X $ is positive definite by computing all its eigenvalues and showing that they are positive and real.
10. Given the solution of linear regression, predict the `class` column and report mis-classification error on validation data. Compare the performance of this classifier against decision tree classifier that you have implemented in Assignment 1.

### 3. Likelihood Maximization (1pt)

Consider the function $\theta^3(1-\theta)^7$.  Let $\alpha$ be such that $\theta = 1/(1+e^{-\alpha})$. 
 - Plot the function when $\alpha$ is between -6 and 6. 
 - Maximize the function over $\alpha$ using SGD that was implemented in Q1.
 - Maximize the log of the function over $\alpha$ using SGD. Are the solutions obtained the same? Why/Why not?