1\. **Spotting correlations**

Load the remote file:

```bash
https://www.dropbox.com/s/aamg1apjhclecka/regression_generated.csv
```

with Pandas and create scatter plots with all possible combinations of the following features:
    
  + features_1
  + features_2
  + features_3
  
Are these features correlated?

In [None]:
# libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import scip

In [None]:
url='https://www.dropbox.com/s/aamg1apjhclecka/regression_generated.csv?dl=1'
# file = urllib.request.urlopen(url)
pl = pd.read_csv(url) 
# pl

In [None]:
# ploting the features against labels

%matplotlib inline

pl.plot.scatter(x='features_1',y='features_2',c='label',colormap='viridis')
pl.plot.scatter(x='features_2',y='features_3',c='label',colormap='gist_rainbow')
pl.plot.scatter(x='features_1',y='features_3',c='label',colormap='viridis')


2\. **Color-coded scatter plot**

Produce a scatter plot from a dataset with two categories.

* Write a function that generates a 2D dataset consisting of 2 categories. Each category should distribute as a 2D gaussian with a given mean and standard deviation. Set different values of the mean and standard deviation between the two samples.
* Display the dataset in a scatter plot marking the two categories with different marker colors.

An example is given below:

In [None]:
from IPython.display import Image
Image('images/two_categories_scatter_plot.png')

In [None]:
# multivariate_normal(mean, cov, size=None, check_valid='warn', tol=1e-8)
# print(x)

def generate_mul(n):
    fig, ax = plt.subplots(figsize=(8, 5))
    fig.tight_layout()

    for i in range(n):
#         x=np.random.multivariate_normal([1,1],[[2,0],[0,3]], size =100)
#         y=np.random.multivariate_normal([1,1],[[2,0],[0,3]], size =100)
        x = np.random.normal(np.random.uniform(0.,20.), np.random.uniform(0.,1.), 1000) 
        y = np.random.normal(np.random.uniform(0.,20.), np.random.uniform(0.,1.), 1000) 
#         df = pd.DataFrame(x,columns=['A', 'B'])
#         x1=df['A']
#         y1=df['B']
        plt.scatter(x,y)

generate_mul(2)

3\. **Profile plot**

Produce a profile plot from a scatter plot.
* Download the following pickle file:
```bash
wget https://www.dropbox.com/s/3uqleyc3wyz52tr/residuals_261.pkl -P data/
```
* Inspect the dataset, you'll find two variables (features)
* Convert the content to a Pandas Dataframe
* Clean the sample by selecting the entries (rows) with the absolute values of the variable "residual" smaller than 2
* Plot a Seaborn `jointplot` of "residuals" versus "distances", and use seaborn to display a linear regression. 

Comment on the correlation between these variables.

* Create manually (without using seaborn) the profile histogram for the "distance" variable; choose an appropriate binning.
* Obtain 3 numpy arrays:
  * `x`, the array of bin centers of the profile histogram of the "distance" variable
  * `y`, the mean values of the "residuals", estimated in slices (bins) of "distance"
  * `err_y`, the standard deviation of the of the "residuals", estimated in slices (bins) of "distance"
* Plot the profile plot on top of the scatter plot

In [None]:
# !type data\residuals_261.pkl
df = pd.read_pickle("data/residuals_261.pkl")
# d_f=pd.DataFrame(df)

file=dict(pd.read_pickle("data/residuals_261.pkl").item())
data=pd.DataFrame(file)
data = data.loc[(data["residuals"] > -1) & (data["residuals"] < 1)]
sns.jointplot(x="distances", y="residuals", data=data, kind = "reg", scatter_kws={"s": 5}, line_kws={"color": "blue"})
plt.show()




figure, axis = plt.subplots(2, 1,figsize=(6,6))
a,bins,_=axis[0].hist(data.distances,bins=30)
x=np.array([0.5*(bins[i]+bins[i+1]) for i in range(0,len(bins)-1)])
y=np.zeros(len(bins))
err_y=np.zeros(len(bins))
for i in range(0,len(bins)-1):
    c=data.loc[(data["distances"] >bins[i]) & (data["distances"] <bins[i+1])].residuals
    y[i]=np.mean(c)
    err_y[i]=np.std(c)
# print(x)
# print(y)
print("error",err_y)
axis[1].scatter(data.distances,data.residuals,s=2)
plt.show()

4\. **Kernel Density Estimate**

Produce a KDE for a given distribution (by hand, not using seaborn):

* Fill a numpy array `x` of length N (with $N=\mathcal{O}(100)$) with a variable normally distributed, with a given mean and standard deviation
* Fill an histogram in pyplot taking proper care of the aesthetic:
   * use a meaningful number of bins
   * set a proper y axis label
   * set proper value of y axis major ticks labels (e.g. you want to display only integer labels)
   * display the histograms as data points with errors (the error being the poisson uncertainty)
* For every element of `x`, create a gaussian with the mean corresponding to the element value and the standard deviation as a parameter that can be tuned. The standard deviation default value should be:
$$ 1.06 * x.std() * x.size ^{-\frac{1}{5}} $$
you can use the scipy function `stats.norm()` for that.
* In a separate plot (to be placed beside the original histogram), plot all the gaussian functions so obtained
* Sum (with `np.sum()`) all the gaussian functions and normalize the result such that the integral matches the integral of the original histogram. For that you could use the `scipy.integrate.trapz()` method. Superimpose the normalized sum of all gaussians to the first histogram.


In [None]:
fig, axs = plt.subplots(1, 2,figsize=(15, 7))
N=400

x= np.random.normal(0, 1, N)
val,bin_edge,_=axs[0].hist(x, bins=50)


bin_center=np.array([0.5*(bin_edge[i]+bin_edge[i+1]) for i in range(0,len(bin_edge)-1)])
err_y=[]
for i in range(len(val)):
    if val[i]!=0:
        err_y.append(1/np.sqrt(val[i])) 
    else:
        err_y.append(0)
        
axs[0].errorbar(bin_center, val, yerr=err_y, fmt="+k", ecolor='red', elinewidth=2, markersize=5)




#############

x_axis = np.arange(-5, 5, 0.1)
tot=0
for mu in x:
    sigma=1.06*np.std(x)*x.size**(-1/5)
    norm=stats.norm.pdf(x_axis, mu, sigma)
    axs[1].plot(x_axis,norm)
    tot+=norm
    
integral=scipy.integrate.trapz(tot)
print("Integral of sum:",integral)
intX=scipy.integrate.trapz(val)
print("Hist Integral:",intX)
fact=(intX/integral)
print("Factor",fact)
                    
T=tot[:]*fact
intT=scipy.integrate.trapz(T)
print("New gaussian:",intT,"equals to histo")
axs[0].plot(x_axis,T,linewidth=2.5)
plt.show()