### Activity 1

#### 1. What is the difference between a normal distribution and standard normal distribution?

A standard normal distribution is a normal distribution with the mean equal to 0 and the standard deviation equal to 1.

#### 2. Go through the documentation on uniform distribution.

In [None]:
# Import relevant modules / libraries
import pandas as pd
import numpy as np
from scipy.stats import uniform
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create the drawing space
fig, ax = plt.subplots(1, 1)

# Compute the mean, variance, skew, and kurtosis for the standard uniform distribution:
mean, var, skew, kurt = uniform.stats(moments='mvsk')

# PPF = Percent Point Function - i.e. percentile
# np.linspace is used to "return evenly spaced numbers over a specified interval".

# Combining ppf with linspace gets the x-axis numbers. For the standard uniform
# distribution, using x = np.linspace(0.01, 0.99, 100) leads to similar results.
x = np.linspace(uniform.ppf(0.01), uniform.ppf(0.99), 100)

# PDF = Probability Density Function. This generates random values that fit the
# uniform standard distribution for the values in x

# ax.plot keywords: lw = linewidth, alpha = transparency

# We're going to plot the probability density function
ax.plot(x, uniform.pdf(x), 'r-', lw=5, alpha=0.6, label='uniform pdf')

# Freeze the distribution and display the frozen pdf:
rv = uniform()
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

# np.allclose "returns True if two arrays are element-wise equal within a tolerance."
# CDF = Cumulative Distribution Function

# "Check accuracy of cdf and ppf:"
vals = uniform.ppf([0.001, 0.5, 0.999])
print(np.allclose([0.001, 0.5, 0.999], uniform.cdf(vals)))

# RVS = Random VariateS
r = uniform.rvs(size=1000)

# Plot the distribution
ax.hist(r, density=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()

#### 3. Discuss the use of uniform distribution in generating random numbers.

TBA

#### 4. Plot a uniform distribution.

In [None]:
distribution = np.random.randn(1000)

fig, ax = plt.subplots(figsize=(9, 12))
plt.hist(distribution)
plt.show()

### Acitivity 2

#### 1. What kind of problems we can solve with supervised learning? List down some examples with the factors and the target.


> Supervised learning problems, where an output attribute is present in the dataset and the target of the analysis is to derive a model that describes the relationship between this output attribute and other input attributes in the dataset.

There are 2 different types of supervised learning: *regression* & *classification*.

Regression problems examples:  

* Finding the highest margin donors / customers from a non-profit / business database  
    * *Factors:* gender, state, income, political view 
    * *Target:* total donation / total profit
- Forecasting future stock prices (or temperatures, revenue):
    - *Factors:* political events, press releases, selling / buying of stocks in the previous months
    - *Target:* stock price

*Algorithms:* linear regression, logistic regression and polynomial regression

Classification problems examples:
* Classify spam email from non-spam email
    * *Factors:* email addresses, subject lines, number of typos in the email body, key phrases in the email (e.g. requests for credit card details), the presence of non-regular attachments (e.g. *.exe*, *.apk* files)
    * *Target:* finding spam email
- Finding out what customers would buy based on a previous profile
    - *Factors:* delivery address, other visited websites, other products they viewed, total basket amount, previously bought items, average price of products
    - *Target:* creating "customer types" (to later target advertisement to)
* Whether car sales will increase, decrease or remain stable (3 possible outcomes) over the next 12 months.
    * *Factors:* number of cars sold in the past months, car prices, 
    * *Target:* car sales trend

*Algorithms:* linear classifiers, support vector machines, decision trees, random forest

#### 2. What kind of problems we can solve with unsupervised learning? List down some examples with the factors and the target.

Unsupervised learning allows machines to discover hidden patterns in data, which helps in *clustering*, *association*, and *dimensionality reduction* problems. The goal of unsupervised learning is to offer undiscovered insights about data.

Clustering problems examples:  

* Market segmentation
* Image compression
* News sections

*Algorithms:* exclusive (K-means clustering), overlapping (fuzzy K-means clustering), hierarchical (Ward's, average, minimum, and maximum linkage), and probabilistic (Gaussian mixture models).

Association problems examples:  

* Market basket analysis
* Recommendation engines, e.g. Spotify, Amazon

*Algorithms:* Apriori, Eclat, FP-Growth

Dimensionality reduction problems examples:  

* Reduce the nunmber of features in a dataset (to prevent data overfitting, improve data
visualization)
* Removing noise from a picture to improve quality
* Anomaly detection 

*Algorithms:* principal component analysis, singular value decomposition, autoencoders (which leverage neural networks)

### Activity 3

#### Read the article medium article. List down some of the keywords that you identified in the article. Do not worry about understanding what do the terms mean right now. We will discuss linear regression in greater detail as we go on.

The linear regression model is used to fit a slope to the target data given a number of features. In order to calculate the gradient and intercept of the slope, the algorithm calculates the mean squared error of the slope and iterates until it reaches the minimum possible MSE for the given features and target data.

### Activity 4

#### Display the OLS summary using Statmodels library and conduct your research to find out the different statistics in the table. What does each element in the table mean? (There is a summary table in the output when you use linear regression with Statmodels.)

In [None]:
# Generate dataframe with random numbers for simple linear regression model
df = pd.DataFrame(100 * np.random.rand(1000, 2), columns=['feature', 'target'])

# X-y split
X = df['feature']
y = df['target']

# Fit the data
X = sm.add_constant(X)      # Adds a column of ones to the array because no constant is added by 
                            # default in the model
model = sm.OLS(y, X).fit()

# Print the summary
print(model.summary())

What we see in the OLS summary and what it means:
- **Df Residuals:** sample size (no. of observations) - (number of variables + 1)
> Degree of freedom (Df) is the number of independent observations on the basis of which the sum of squares is calculated.
- **R-squared:** 
> The coefficient of determination that tells us that how much percentage variation independent variable can be explained by independent variable, e.g. 66.9 % variation in y can be explained by X. 
In this case, 0% of the variation in y can be explained by X.
- **F-statistic:**
> F test tells the goodness of fit of a regression. The test is similar to the t-test or other tests we do for the hypothesis. 