# S3. Introduction to Descriptive Statistics.

<img src="Figures/stats1.png" alt="Drawing" style="width: 600px;"/>

### CONTENTS 
* 1. Sample vs Population
* 2. Summary and data cleaning
* 3. Descriptive Statistics visualization

## 1. Sample vs. Population

We will usually find ourselves in a situation where we wish to answer questions about a certain *population* but only have access to a *sample*.

<img src="Figures/Poblacion1.svg" alt="Drawing" style="width: 450px;"/>

* The **population** refers to all individuals who are relevant to a particular question or study, whereas a **sample** will be just a subset of these. 
* For example, all the customers of a distribution company will be the population, whereas for a study we may only use a sample of them. Sometimes what for one question is a *population*, for another will be a sample (in our example all the customers of a distribution company are just a sample of the population of a country).

<img src="Figures/Poblacion2.svg" alt="Drawing" style="width: 450px;"/>

Populations and samples are made up of several observations, individuals, elements, etc.

<img src="Figures/Poblacion3.svg" alt="Drawing" style="width: 450px;"/>

Whenever possible, it will be better to use the population to answer our questions, but sometimes this is not possible (you do not have all the data, not all customers have a Smart meter, collecting all the data from the same source is difficult, etc.). In these cases we will use a sample. 

What is the problem with this? Let's see an example, with our London dataset

In [None]:
import pandas as pd # Pandas!
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


london = pd.read_csv('Data/block_13_diario.csv')
london.head(10)

In [None]:
london.describe()

Let's create two different samples

In [None]:
sample1 = london.iloc[1:50,4]
sample2 = london.iloc[1345:1854,4]

In [None]:
sample1.describe()

In [None]:
sample2.describe()

As can be seen, for each sample we obtain different metrics or statistics. This error is the *estimation error* or the *sampling error*. 




**Population mean (µ)**

In [None]:
mean_population = london['energy_max'].mean()
print('Mean population:',mean_population)

**Samples mean (x̅)**

In [None]:
mean_sample1 = sample1.mean()
mean_sample2 = sample2.mean()

print('Mean sample 1:', mean_sample1)
print('Mean sample 2:', mean_sample2)

**Samples error**

In [None]:
error1 = mean_population - mean_sample1
error2 = mean_population - mean_sample2

print('Sample 1 error:',error1)
print('Sample 2 error:',error2)

If we do it for several samples, with the same number of observations in a random way, automatically:


`dataframe.sample()` method returns a random sample of items from an axis of object.

In [None]:
import matplotlib.pyplot as plt

num_obs = 20
num_samples = 100
list_mean=[]

for sample in range(num_samples):
    s = london['energy_max'].sample(num_obs, random_state=sample)
    list_mean.append(s.mean())
    
plt.scatter(range(1,101),list_mean, label= 'Samples mean')
plt.axhline(london['energy_max'].mean(), color='green', label = 'Population mean')

# Set fixed y-axis limits (change values as per your requirement)
plt.ylim(ymin=0.2, ymax=1.6)
plt.xlabel('# Sample')
plt.ylabel('Mean')
plt.legend()
plt.show()

### How can we solve this? 

One way is, as mentioned, to try to get as close as possible to the entire population. Let's look at our example:

In [None]:
num_obs = 50
num_samples = 100
lists_mean=[]

for num_obs in [50,250,500,1000]:
    list_mean=[]
    for sample in range(num_samples):
        s = london['energy_max'].sample(num_obs, random_state = sample)
        list_mean.append(s.mean())
    lists_mean.append(list_mean)
    
plt.figure(figsize=(15,8))

plt.subplot(2,2,1)
plt.scatter(range(1,101),lists_mean[0])
plt.axhline(london['energy_max'].mean(), color='green')
plt.ylim(0.6, 1.1)
plt.ylabel('Mean')
plt.title('50 samples')

plt.subplot(2,2,2)
plt.scatter(range(1,101),lists_mean[1])
plt.axhline(london['energy_max'].mean(), color='green')
plt.ylim(0.6, 1.1)
plt.ylabel('Mean')
plt.title('250 samples')

plt.subplot(2,2,3)
plt.scatter(range(1,101),lists_mean[2])
plt.axhline(london['energy_max'].mean(), color='green')
plt.ylim(0.6, 1.1)
plt.ylabel('Mean')
plt.title('500 samples')

plt.subplot(2,2,4)
plt.scatter(range(1,101),lists_mean[3])
plt.axhline(london['energy_max'].mean(), color='green')
plt.ylim(0.6, 1.1)
plt.ylabel('Mean')
plt.title('1000 samples')


# plt.subplot(2,2,4)
# plt.scatter(range(1,101),lists_mean[4], alpha=0.4)
# plt.axhline(london['energy_max'].mean(), color='green')
# plt.ylim(0.6, 1.1)
# plt.ylabel('Mean')
# plt.title('1800 samples')

plt.show()

We must also ensure that the sample is representative of the different possible categories in our dataset. For that you can use **stratified sampling**.

<img src="Figures/stratified_sampling.jpg" alt="Drawing" style="width: 450px;"/>

## 2. Data summary and data cleaning

Pandas offers us several options to obtain a summary of the data as the *describe* method seen above. In addition, we can use other methods such as:

<img src="Figures/pandas_summary.png" alt="Drawing" style="width: 450px;"/>

Other interesting options to obtain frequency distributions of the data in a column are the methods *value_counts* and *nunique*.

In [None]:
### Try some of these functions

london['energy_max'].sum()
london['energy_max'].max()

<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 1</span>

Calculate standard deviation for MAC000113
    </div>

In [None]:
# write your code here







### Data cleaning: treat the missing data

There are several options for dealing with empty values, but pandas offers us some quick and interesting options to go fast

<img src="Figures/missing.png" alt="Drawing" style="width: 450px;"/>

One tool that goes well for summarizing is the *pandas-profiling* library which summarizes the data in a *dataframe* and shows us interesting summarized and grouped results.


<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 2</span>

How many missing data the londond dataset have?
    </div>

In [None]:
# write your code here








<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 3</span>
Which method would you use? dropna() or .fillna()
    </div>

In [None]:
# write your code here









In [None]:
# how many missing data do we have now?









# 3. Descriptive Statistics and visualization

## 3.1 Frequency Distribution

A data set is made up of a distribution of values. This is valuable information to understand the dataset we are working with.


Several plot tools help visualize the distribution of our values. We are going to focus in the following two:

* Histogram
* Density Plots

In [None]:
london.head()

In [None]:
### Let's make a histogram
import seaborn as sns
from matplotlib import colors as mcolors

colors = dict(mcolors.BASE_COLORS, **mcolors.CSS4_COLORS)
colors_names = [name for name, color in colors.items()]

# print(colors_names)

plt.figure(figsize=(12,8))
sns.histplot(data=london, x = "energy_mean", bins=20, element="bars")

In [None]:
SMs = london['LCLid'].unique()
SMs

In [None]:

plt.figure(figsize=(12,8))


for i, SM in enumerate(SMs[:4]):
    sns.histplot(data = london[london['LCLid']== SM], x = "energy_max", kde=True, label= SM, color=colors_names[i], bins = 50, alpha=0.2)
    plt.legend( loc='best', bbox_to_anchor=((1,1)))

### Let's make a density plot

In [None]:
sns.displot(london.loc[london['LCLid']=='MAC000113']['energy_max'], kde=True)
plt.show()


https://seaborn.pydata.org/generated/seaborn.displot.html

In [None]:

sns.kdeplot(london.loc[london['LCLid']=='MAC000113']['energy_max'])
plt.show()

<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 4</span>
Plot in the same figure the density plot of the Energy_max for the last 5 end-users
    </div>

In [None]:
# write your code here












## 3.2 Measures of central tendency: Mean, median and mode

Once we have a summary of the data used, some parameters that can be very useful to know how certain characteristics of our *dataset* are distributed are:
* Mean
* Median
* Mode

#### Mean

We can think of the mean as the center of gravity of the data of a distribution. Let's look at an example and discuss what information can be obtained and how it can help us or, conversely, misinform us if we are not careful.

In [None]:
import random
import numpy as np

population = [0,2,3,3,3,4,13]
sample = random.choices(population, k=3)  #Randomly select 4 values from the population.

mean_pop = np.mean(population)
mean_samp = np.mean(sample)

print('Mean Population:', mean_pop)
print('Media Sample:', mean_samp)
print(sample)

Pandas provide accesibility to common calculations, such as:

In [None]:
london['energy_max'].mean()

#### Mean and Median

*netherlands* Dataset. 

Source: https://www.kaggle.com/datasets/lucabasa/dutch-energy/data


In [None]:
import pandas as pd
netherlands = pd.read_csv('data/Electricity_Netherlands/coteq_electricity_2019.csv')
netherlands.dropna()
netherlands.head(5)

If you want to obtain the average consumption of the whole dataset, you could calculate it as:

In [None]:
netherlands['annual_consume'].mean()

In [None]:
netherlands['annual_consume'].median()

#### But if we take a good look at the *dataset* we can see that this average is not fair. Why?

In [None]:
# PROFESSOR
# let's check for some max values and minimum
netherlands.describe()

# let's check histogram
plt.figure(figsize=(12,8))
sns.histplot(data = netherlands["annual_consume"], color=colors_names[0], bins = 50, alpha=0.5, label='Week')
plt.axvline(x=netherlands["annual_consume"].mean(), color='red', linestyle='--', label='Threshold')
plt.axvline(x=netherlands["annual_consume"].median(), color='yellow', linestyle='--', label='Threshold')

We have seen how there are times when computing the average, even if it can be done, would not be correct. At other times, what will happen is that we cannot compute the mean at all. For example

In this case, the **median** may be a good alternative measure.

Another advantage of the median is that it does not consider equally all elements of the distribution, which makes it more resistant to changes in the distribution.

#### Mode

We have seen that sometimes the mean will not give us the information we are looking for, or simply cannot be calculated and we will use the median. On other occasions, however, the mode can also be useful to us. For example:

In [None]:
netherlands['city'].head(5)

In [None]:
netherlands['city'].value_counts()

## 3.3 Variability

Let's look at two distributions:

In [None]:
import numpy as np

A=[4,4,4,4]
B=[0,8,0,8]

print('The mean of A is:',np.mean(A))
print('The mean of B is:',np.mean(B))

In [None]:
plt.hist(A, label="A")
plt.hist(B, label= "B")
plt.legend()

Indeed, two very different distributions can have the same mean.

What other parameter can help us to distinguish the two distributions? For example, the range:

In [None]:
range_A=max(A)-min(A)
range_B=max(B)-min(B)

print('The range of A is:',range_A)
print('The range of B is:',range_B)

But the range only considers two values, and it is not a good solution:

In [None]:
C=[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,21]

range_C=max(C)-min(C)
print('The mean of C is:',np.mean(C))
print('The range of C is:',range_C)

We see that we have a distribution with very little variability, but with a very high range. This is because it only considers two values of the distribution and not the whole distribution.

If we consider all values we can calculate:

<img src="Figures/variabilities.svg" alt="Drawing" style="width: 450px;"/>

To avoid this, we will use:

$$ Variance = \frac{1}{n} \sum_i (x_i - \mu)^2 $$

<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 5</span>
Use the variance formula for the top three distributions. Write a function. What values are obtained?
    </div>

In [None]:
# We write the mean for all the distributions

dist1 = [1, 2, 8, 9]
dist2 = [3, 4, 6, 7]
dist3 = [5, 5, 5, 5]

In [None]:
# write your code here (variance function)










In [None]:
print('Variance Dist 1: ', var_func(dist1))
print('Variance Dist 2: ', var_func(dist2))
print('Variance Dist 3: ', var_func(dist3))

!!!! The problem with the variance is that it does not give a value that does not make any sense to us.

A shorter way: the variance method is `np.var()`

In [None]:
print('Variance Dist 1: ', np.var(dist1))
print('Variance Dist 2: ', np.var(dist2))
print('Variance Dist 3: ', np.var(dist3))

### Let's see another example

In [None]:
week_consum =[0, 7, 8, 5, 7]

print('The variance:', np.var(week_consum))

For this, we use  **standard deviation**

Variance $$ \sigma^2 = \frac{1}{n} \sum_i (x_i - \mu)^2 $$

Standard Deviation $$ \sigma = \sqrt (\frac{1}{n} \sum_i (x_i - \mu)^2) $$


In [None]:
week_consum =[0, 7, 8, 5, 7]

print('The standard deviation is:', np.std(week_consum))
print('The mean is:', np.mean(week_consum))

## Standard deviation in normal distributions and boxplots

<img src="Figures/boxplot_normal_dist.png" alt="Drawing" style="width: 600px;"/>

### Let's calculate the distribution and standard deviation for a sample: end user MAC000113, Energy Max

In [None]:
MAC000113_energy_max = london.loc[london['LCLid']=='MAC000113']['energy_max']

In [None]:
import matplotlib.pyplot as plt

mean = MAC000113_energy_max.mean()
st_dev = MAC000113_energy_max.std()
MAC000113_energy_max.plot.hist(bins=20)
plt.axvline(mean, color = 'black', label = 'Mean')
plt.axvline(mean - st_dev, color = 'Red', label = '-1σ')
plt.axvline(mean + st_dev, color = 'Violet', label = '+1σ')
plt.axvline(mean - 2*st_dev, color = 'Red', label = '-2σ')
plt.axvline(mean + 2*st_dev, color = 'Violet', label = '+2σ')
plt.axvline(mean - 3*st_dev, color = 'Red', label = '-3σ')
plt.axvline(mean + 3*st_dev, color = 'Violet', label = '+3σ')
plt.xlabel('Energy')
plt.legend()

<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 6</span>
Show the boxplot
    </div>

In [None]:
# Write your code here








### Let's check another example and distribution: energy max from population

In [None]:
import matplotlib.pyplot as plt

mean = london['energy_max'].mean()
st_dev = london['energy_max'].std()
london['energy_max'].plot.hist(bins=20)
plt.axvline(mean, color = 'Black', label = 'Mean')
plt.axvline(mean - st_dev, color = 'Red', label = '-1σ')
plt.axvline(mean + st_dev, color = 'Violet', label = '+1σ')
plt.axvline(mean - 2*st_dev, color = 'Red', label = '-2σ')
plt.axvline(mean + 2*st_dev, color = 'Violet', label = '+2σ')
plt.axvline(mean - 3*st_dev, color = 'Red', label = '-3σ')
plt.axvline(mean + 3*st_dev, color = 'Violet', label = '+3σ')
plt.xlabel('Energy')
plt.legend()

<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 7</span>
Show the boxplot
    </div>

In [None]:
# write your code here








<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 8</span>
    
Show the boxplots of all endusers (LCLid) for the Energy Max. 
    
 + Which end-user has more variability?
 + Show the distribution of the user with higher variability according to the boxplot
    </div>

In [None]:
# Your code here: BOXPLOT of all endusers for EnergyMax











In [None]:
# Your code here: DISTRIBUTION for EnergyMax for Enduser with more variability 













<div style="background-color:#ccffcc; padding:10px; border-radius:5px;">

### <span style="color:blue">Exercise 9</span>

Create for MAC005331 two histograms in one Figure: one for energy_mean on week days and another for energy_mean on weekends.
    </div>

In [None]:
# write your code here


