1\. **Spotting correlations**

Load the remote file:

```bash
https://www.dropbox.com/s/aamg1apjhclecka/regression_generated.csv
```

with Pandas and create scatter plots with all possible combinations of the following features:
    
  + features_1
  + features_2
  + features_3
  
Are these features correlated?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
file_name = "regression_generated.csv"
data = pd.read_csv(file_name)
x = data['f1']
y = data['f2']
z = data['f3']
fig, (ax1, ax2 , ax3) = plt.subplots(nrows=3, ncols=1, figsize=(12,36))
ax1.scatter(x=x, y=y, marker='o', c='c', edgecolor='b')
ax2.scatter(x=x, y=z, marker='o', c='r', edgecolor='k')
ax3.scatter(x=y, y=z, marker='o', c='g', edgecolor='k')
ax1.set_xlabel("f1")
ax1.set_ylabel("f2")
ax2.set_xlabel("f1")
ax2.set_ylabel("f3")
ax3.set_xlabel("f2")
ax3.set_ylabel("f3")

2\. **Color-coded scatter plot**

Produce a scatter plot from a dataset with two categories.

* Write a function that generates a 2D dataset consisting of 2 categories. Each category should distribute as a 2D gaussian with a given mean and standard deviation. Set different values of the mean and standard deviation between the two samples.
* Display the dataset in a scatter plot marking the two categories with different marker colors.

An example is given below:

In [None]:
from IPython.display import Image
Image('images/two_categories_scatter_plot.png')

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def gaussian(mean,sta_dev) : 
    return np.random.normal(mean,sta_dev,300)

x1 = gaussian(1,0.4)
y1 = gaussian(1,0.4)
plt.scatter(x1,y1)

x2 = gaussian(0,0.6)
y2 = gaussian(0,0.6)
plt.scatter(x2,y2)

plt.show()

3\. **Profile plot**

Produce a profile plot from a scatter plot.
* Download the following pickle file:
```bash
wget https://www.dropbox.com/s/3uqleyc3wyz52tr/residuals_261.pkl -P data/
```
* Inspect the dataset, you'll find two variables (features)
* Convert the content to a Pandas Dataframe
* Clean the sample by selecting the entries (rows) with the absolute values of the variable "residual" smaller than 2
* Plot a Seaborn `jointplot` of "residuals" versus "distances", and use seaborn to display a linear regression. 

Comment on the correlation between these variables.

* Create manually (without using seaborn) the profile histogram for the "distance" variable; choose an appropriate binning.
* Obtain 3 numpy arrays:
  * `x`, the array of bin centers of the profile histogram of the "distance" variable
  * `y`, the mean values of the "residuals", estimated in slices (bins) of "distance"
  * `err_y`, the standard deviation of the of the "residuals", estimated in slices (bins) of "distance"
* Plot the profile plot on top of the scatter plot

In [None]:
import pandas as pd   
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
data=np.load("residuals_261.pkl",allow_pickle=True).item()
data=pd.DataFrame(data)

x = data.residuals
y = data.distances

data.drop( data[abs(data['residuals']) < 2 ].index , inplace=True)
data=data[abs(data['residuals'])>2] 

sns.jointplot(x="residuals", y="distances", data=data)
print(data)
print(data.info())



slope, intercept, r_value, p_value, stderr = stats.linregress(x, y)
print("slope =",slope," intercept =",intercept," r_value =",r_value," p_value =",p_value," stderr =",stderr)
sns.jointplot(data=data, x="residuals", y="distances", kind="reg")

dis_array=np.array(data.distances)
bins=[0,5,10,15,20]
dis_bin=np.histogram(dis_array,bins = bins ) 
print(dis_bin)
binned_data=pd.cut(dis_array,bins)
print(binned_data)

mean_std=data.groupby(pd.cut(data['distances'], bins=bins))['residuals'].agg(['mean','std'])
print(mean_std)



fig = plt.figure(figsize=(10, 6))
ax = fig.add_axes([0.1, 0.1, 0.65, 0.65])
ax_histx = fig.add_axes([0.1, 0.8, 0.65, 0.2], sharex=ax)
ax_histx.set_title('Histogram')
ax.scatter(data['distances'],data['residuals'])
h,bins,_ = ax_histx.hist(data['distances'],bins = 25)
x = 0.5*(bins[1:]+bins[:-1])
y = np.zeros(len(bins))
err_y = np.zeros(len(bins))
for i in range(0,len(x)-1):
    mask_i = (data['distances']>x[i]) & (data['distances']<x[i+1])
    y[i] = np.mean(data[mask_i].residuals)
    err_y[i] = np.std(data[mask_i].residuals)
plt.show()
print("x: ",x)
print("y: ",y)
print("err_y:", err_y)

4\. **Kernel Density Estimate**

Produce a KDE for a given distribution (by hand, not using seaborn):

* Fill a numpy array `x` of length N (with $N=\mathcal{O}(100)$) with a variable normally distributed, with a given mean and standard deviation
* Fill an histogram in pyplot taking proper care of the aesthetic:
   * use a meaningful number of bins
   * set a proper y axis label
   * set proper value of y axis major ticks labels (e.g. you want to display only integer labels)
   * display the histograms as data points with errors (the error being the poisson uncertainty)
* For every element of `x`, create a gaussian with the mean corresponding to the element value and the standard deviation as a parameter that can be tuned. The standard deviation default value should be:
$$ 1.06 * x.std() * x.size ^{-\frac{1}{5}} $$
you can use the scipy function `stats.norm()` for that.
* In a separate plot (to be placed beside the original histogram), plot all the gaussian functions so obtained
* Sum (with `np.sum()`) all the gaussian functions and normalize the result such that the integral matches the integral of the original histogram. For that you could use the `scipy.integrate.trapz()` method. Superimpose the normalized sum of all gaussians to the first histogram.
