# Data Analysis

## What you'll learn in this course

As you are learning Data Science, you will need to get up to speed in Math & Statistics. This course will give you everything you need. In this course, you'll learn: 

* Stats Jargon 
* Variables 
* Measure Center 
* Measure Variation 
* Distribution 
* Boxplots 


## Data Analysis introduction


### First notions : Dataset, individuals, variables

Before we get into data science, or even analysis per se, we'll start with some vocabulary points and some definitions that will be important in everything that follows.


#### Dataset / Sample / Population


* **Population:** In the context of a statistical study, the population is the set of subjects that one would like to study or that are concerned by this study. For example, if a study on wages in France is conducted, the population represents the entire French working population. If we are trying to build a model that finds faces in photos on the web, the population is all the photos on the web.
* **Sample :** In practice, it is rarely possible to study the population as a whole, for reasons of time and cost of data retrieval, and in a Big Data context of material limitations related to the volume of data to be processed. Therefore, a sample of a small number of representatives of the population is studied. Statistics and data science consists in using specific examples in order to draw general conclusions about the population.
* **Dataset :** All the information collected on the individuals in the sample is gathered in what is called a dataset. This is a word that will be used throughout the course. A dataset consists of individuals or observations (or even lines or rows) that make up the representatives of the population present in the given sample. It also consists of variables or features (the columns) that contain characteristics that describe the specificity of each observation.


#### Variables

The study of variables is the main purpose of descriptive statistics. There are several types of variables that are described in the following table:


<table>
  <tr>
   <td colspan="4" >Variables
   </td>
  </tr>
  <tr>
   <td colspan="2" >Qualitatives
   </td>
   <td colspan="2" >Quantitatives
   </td>
  </tr>
  <tr>
   <td>Nominal
   </td>
   <td>Ordinal
   </td>
   <td>Continuous
   </td>
   <td>Discrete
   </td>
  </tr>
  <tr>
   <td>A variable that describes a characteristic that cannot be sorted from smallest to largest.
<p>
Example: colors, city names...
   </td>
   <td> variable that describes a characteristic that takes only a limited number of modalities that can be ranked in ascending order.
<p>
Example: an individual review on an amazon product, a school mark numbered from A to E
   </td>
   <td> variable that describes a characteristic measured by a real number (to summarize decimal numbers).
<p>
Example: A salary, an oxygen level measurement, a price
   </td>
   <td> variable that describes a characteristic measured by an integer
<p>
Example: an age measured in years, a number of visits to a web page
   </td>
  </tr>
</table>



## Univariate analysis

Univariate analysis is the study of one variable at a time. This can be done in several ways: the production of statistics to describe the centre and variations of the variable at the sample level, graphical visualizations.


### Measuring the center

Measuring the centre provides an understanding of the characteristics of an observation that would be the centre of the sample, i.e. an observation that represents a typical example from the population. There are three common ways to measure the centre of the distribution of a variable:


#### Average

The average is the sum of all the measurements in your sample divided by the total number of individuals. The formula is as follows:

$$ \bar{X} = \frac{\sum^{n}_{i=1}X_i}{n}$$



Example: We have the following sample: Mary got 17/20 on duty, Roman got 14/20 and Michael got 15/20. On average, all three got:


$$\bar{X} = \frac{(17 + 14 + 15)}{3} = 15.33$$





The median is literally the middle of your sample. However, it must be ordered from the smallest value to the largest. If you have an even number of individuals in your sample, you will simply take the average of the two values in the middle of your dataset.

Example: Here is the distribution of salaries among 5 employees:

<table>
  <tr>
   <td>Alexis
   </td>
   <td>20 000$ / year
   </td>
  </tr>
  <tr>
   <td>Sarah
   </td>
   <td>22 000$ / year
   </td>
  </tr>
  <tr>
   <td>Jean-claude
   </td>
   <td>23 000$ / year
   </td>
  </tr>
  <tr>
   <td>Mathilde
   </td>
   <td>40 000$ / year
   </td>
  </tr>
  <tr>
   <td>Bertrand
   </td>
   <td>70 000$ / year
   </td>
  </tr>
</table>


The median of your sample is : 23 000$


#### Mode

Finally the mode is the number that appears most frequently in your sample. If there are no repeating values, then the mode cannot be calculated. Conversely, if you have several repeating values, the highest frequency is taken. This is called a multi-modal distribution. However, computers have a lot of difficulties to manage these features.



The mean is very sensitive to the presence of extreme or outliers in the sample, however it is very simple to calculate and very easy to update if you add new individuals in the dataset. The median is robust, i.e. it is not sensitive to the possible outliers of a variable in the dataset, however it requires all the data in the dataset to be calculated. The mode is also robust but requires to calculate the global distribution of the variable in order to be determined. Moreover it is not necessarily unique.


### Measurement of the variation


#### Intervals

The simplest way to measure variation is with an interval that is the difference between the largest and smallest value in your sample. For example, in 1993, the lowest temperature of the year in Paris was -5 degrees Celsius and the highest was 33 degrees. So the range is 38 degrees.


#### Variance

The variance measures the mean of the sum of the squared mean deviations for each observation in the dataset and for a given variable. The formula for calculating the variance is as follows:

$$V=\frac{\sum^{n}_{i=1}(x_i-\bar{x})^2}{n-1}$$


The above formula is in fact an unbiased variance estimator that is used in practice to calculate the variance of the population from a sample. The variance measures how widely dispersed the values of the variable are: it is a measure of the magnitude of the distribution. It is sensitive to extreme values.


#### Standard Deviation

Much more used, the standard deviation is the square root of the variance. Its value is on the same scale as the values taken by the variable. It allows us to know the extent to which the points in our sample deviate from the mean. The formula is as follows:

$$S = \sqrt{\frac{\sum^{n}_{i=1}(x_i - \bar{x})^2}{n-1}}$$
       
$$S = \sqrt{V}$$
       

#### Ecart en valeur absolu      

$$S = \frac{\sum^{n}_{i=1}|x_i - \bar{x}|}{n-1}$$


Although standard deviations are sensitive to statistical aberrations (i.e. exceptionally large or small numbers), they are not as affected as intervals. The larger your database, the less your results will be affected by these extreme values. It is still recommended that you look at all points that are abnormally far from "normal" values and remove them from your sample, as you know that they only bias the calculations.


Standard deviations can be calculated on any distribution. Through statistical experiments, mathematicians have seen that, empirically, 68% of the values in a sample are often located one standard deviation from the mean. 95% of the values are located two standard deviations from the mean and 99.7% of the values in your sample are located three standard deviations from the mean. None of this has been proven but it has been observed empirically.


![](https://drive.google.com/uc?export=view&id=1r8lPLa3bJhuS8e41BdPAlzRxuWTnM5w2)


#### Z-Score

For each individual in your sample, you can calculate a Z-score. This is a measure that will allow you to determine how many standard deviations your individual is from the mean.
       
(Un Z-score au dessus de 2 peut être considéré comme valeur extrême (5 % de la distribution)

$$Z = \frac{X-\bar{X}}{S}$$
       
$$Z = \frac{X-\bar{X}}{\sqrt{V}}$$


n Z-score is positive for all values that are above average and negative for all values that are below average. For example, if you get a Z-score of 1.5 for an individual in your sample, this means that this individual is 1.5 standard deviations away from the mean.

By convention, values that are within plus or minus 2 times the standard deviation of the mean are considered ordinary. Others are considered uncommon, and values beyond three standard deviations from the mean are considered abnormal.  


#### Mustache box

Another way to look at statistics is through quartiles, quintiles, deciles and percentiles which correspond to 25%, 20%, 10% and 1% of your sample. You will distribute your values in a moustache box that looks like this:


![](https://drive.google.com/uc?export=view&id=1ppWetyptkJPYNZcCVtvaYp1x-kgx_fV7)


To create a quartile, you will need to find 3 numbers that divide your sample into 4 equal parts; so that 25% of your sample is in your first quartile, 25% is in your second quartile, etc. For example on a sample of 100 people who were asked their salary:


* 25 earn less than 2000$ / month
* 25 earn between 2001 and 3000$ / month
* 25 earn between 3001 and 4000$ / month
* 25 earn more than 4000$ / month

Q2 is the median value of your sample. The difference between Q3 and Q1 is called the interquartile range, which covers 50% of all the values in your sample. For example, the interquartile range of our top sample is $2000 ($4000 - $2000).

The bars that represent the ends of the moustache are the values above and below which the values are considered outliers. Moustache boxes are therefore a good way to know whether a variable has outliers or not.

