1\. **Spotting correlations**

Load the remote file:

```bash
https://www.dropbox.com/s/aamg1apjhclecka/regression_generated.csv
```

with Pandas and create scatter plots with all possible combinations of the following features:
    
  + `features_1`
  + `features_2`
  + `features_3`
  
Are these features correlated? Please add a comment.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the file
data = pd.read_csv('regression_generated.csv')

# Features to plot
features = ['features_1', 'features_2', 'features_3']

# Create scatter plots
for i in range(len(features)):
    for j in range(i+1, len(features)):
        plt.scatter(data[features[i]], data[features[j]])
        plt.xlabel(features[i])
        plt.ylabel(features[j])
        plt.show()

# Check correlation between features
correlation = data[features].corr()
print(correlation)

# The correlation matrix indicates low correlation between the features

2\. **Color-coded scatter plot**

Produce a scatter plot from a dataset with two categories.

* Write a function that generates a 2D dataset consisting of 2 categories. Each category should distribute as a 2D gaussian with a given mean and standard deviation. Set different values of the mean and standard deviation between the two samples.
* Display the dataset in a scatter plot marking the two categories with different marker colors.

An example is given below:

In [None]:
from IPython.display import Image
Image('images/two_categories_scatter_plot.png')

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def generate_dataset(mean1, mean2, std1, std2, size):
    # Generate data for category 1
    category1 = np.random.multivariate_normal(mean1, std1, size)
    
    # Generate data for category 2
    category2 = np.random.multivariate_normal(mean2, std2, size)
    
    # Combine the categories
    dataset = np.concatenate((category1, category2))
    
    return dataset

# Set the parameters for the dataset
mean1 = [2, 3]  # Mean of category 1
mean2 = [-1, -2]  # Mean of category 2
std1 = [[0.5, 0], [0, 0.5]]  # Standard deviation of category 1
std2 = [[1, 0], [0, 1]]  # Standard deviation of category 2
size = 100  # Size of each category

# Generate the dataset
dataset = generate_dataset(mean1, mean2, std1, std2, size)

# Separate the categories
category1 = dataset[:size]
category2 = dataset[size:]

# Plot the dataset
plt.scatter(category1[:, 0], category1[:, 1], color='blue', label='Category 1')
plt.scatter(category2[:, 0], category2[:, 1], color='red', label='Category 2')

# Add labels and legend
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()

# Show the plot
plt.show()

3\. **Profile plot**

Produce a profile plot from a scatter plot.
* Download the following pickle file:
```bash
wget https://www.dropbox.com/s/3uqleyc3wyz52tr/residuals_261.pkl -P data/
```
* Inspect the dataset, you'll find two variables (features)
* Convert the content to a Pandas Dataframe
* Clean the sample by selecting the entries (rows) with the absolute values of the variable "residual" smaller than 2
* Plot a Seaborn `jointplot` of "residuals" versus "distances", and use seaborn to display a linear regression. 

Comment on the correlation between these variables.

* Create manually (without using seaborn) the profile histogram for the "distance" variable; choose an appropriate binning.
* Obtain 3 numpy arrays:
  * `x`, the array of bin centers of the profile histogram of the "distance" variable
  * `y`, the mean values of the "residuals", estimated in slices (bins) of "distance"
  * `err_y`, the standard deviation of the of the "residuals", estimated in slices (bins) of "distance"
* Plot the profile plot on top of the scatter plot

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import pickle
import matplotlib.pyplot as plt

# Step 1: Inspect the dataset and convert to a Pandas DataFrame
with open('data/residuals_261.pkl', "rb") as file:
    data = pickle.load(file).item()

dataFrame = pd.DataFrame({"residuals": data["residuals"], "distances": data["distances"]})
print(dataFrame)

# Step 2: Clean the sample by selecting entries with absolute values of "residual" < 2
clean_data = dataFrame[abs(dataFrame['residuals']) < 2]

# Step 3: Plot a Seaborn jointplot of "residuals" versus "distances" with linear regression
sns.jointplot(data=clean_data, x='distances', y='residuals', kind='reg')

# Step 4: Comment on the correlation between the variables
correlation = clean_data['distances'].corr(clean_data['residuals'])
print("The correlation between distances and residuals is:", correlation)

# Step 5: Create the profile histogram manually
bin_width = 10  # Choose an appropriate bin width
bins = np.arange(min(clean_data['distances']), max(clean_data['distances']) + bin_width, bin_width)
x = bins[:-1] + bin_width / 2
y = []
err_y = []

for i in range(len(bins) - 1):
    bin_entries = clean_data[(clean_data['distances'] >= bins[i]) & (clean_data['distances'] < bins[i+1])]
    y.append(bin_entries['residuals'].mean())
    err_y.append(bin_entries['residuals'].std())

# Step 6: Plot the profile plot on top of the scatter plot
plt.scatter(clean_data['distances'], clean_data['residuals'], label='Scatter Plot')
plt.errorbar(x, y, yerr=err_y, color='red', label='Profile Plot')
plt.legend()

# Show the plot
plt.show()

4\. **Kernel Density Estimate**

Produce a KDE for a given distribution (by hand, not using seaborn):

* Fill a numpy array `x` of length N (with $N=\mathcal{O}(100)$) with a variable normally distributed, with a given mean and standard deviation
* Fill an histogram in pyplot taking proper care of the aesthetic:
   * use a meaningful number of bins
   * set a proper y axis label
   * set proper value of y axis major ticks labels (e.g. you want to display only integer labels)
   * display the histograms as data points with errors (the error being the poisson uncertainty)
* For every element of `x`, create a gaussian with the mean corresponding to the element value and the standard deviation as a parameter that can be tuned. The standard deviation default value should be:
$$ 1.06 * x.std() * x.size ^{-\frac{1}{5}} $$
you can use the scipy function `stats.norm()` for that.
* In a separate plot (to be placed beside the original histogram), plot all the gaussian functions so obtained
* Sum (with `np.sum()`) all the gaussian functions and normalize the result such that the integral matches the integral of the original histogram. For that you could use the `scipy.integrate.trapz()` method. Superimpose the normalized sum of all gaussians to the first histogram.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats, integrate

# Define function to create the Kernel Density Estimate (KDE)
def create_kde(x, mean, std):
    kernel = stats.norm(mean, std)
    return kernel.pdf(x)

# Generate numpy array 'x' with normally distributed values
N = 100
mean = 0  # Set your desired mean
std = 1  # Set your desired standard deviation
x = np.random.normal(mean, std, N)

# Plotting the histogram and KDE
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

# Plotting the histogram as data points with errors
counts, bins, _ = axes[0].hist(x, bins='auto', alpha=0.7, color='purple')
bin_centers = 0.5 * (bins[1:] + bins[:-1])
bin_width = bins[1] - bins[0]
errors = np.sqrt(counts)  # Poisson uncertainty
axes[0].errorbar(bin_centers, counts, yerr=errors, fmt='o', color='black')
axes[0].set_ylabel('Counts')
axes[0].yaxis.set_major_locator(plt.MaxNLocator(integer=True))
axes[0].set_xlabel('x')

# Plotting the KDE
kde_sum = np.zeros_like(x)
std_tuned = 1.06 * x.std() * x.size**(-1/5)
for point in x:
    kde = create_kde(x, point, std_tuned)
    kde_sum += kde
    axes[1].plot(x, kde, alpha=0.3, color='green')

# Normalize the sum of KDEs
kde_sum /= integrate.trapz(kde_sum, x)
axes[0].plot(x, kde_sum, color='red', linewidth=2)