# Introduction to Explorarory Data Analysis

### Intro and objectives:
#### review measures of location and variability
#### review methods to explore the distribution of data

### In this lab you will learn:
1. How to compute estimates of location
2. How to compute estimates of variability



## 0. Let's import required libraries and load some data


In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
from scipy import stats

In [3]:
covidDataFrame = pd.read_csv('https://raw.githubusercontent.com/thousandoaks/Maths4DS101/main/data/covid19_cases.csv',parse_dates=['dateRep'])

In [4]:
covidDataFrame.head()

Unnamed: 0,dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
0,2020-09-19,19,9,2020,47,1,Afghanistan,AF,AFG,38041757.0,Asia,1.616645
1,2020-09-18,18,9,2020,0,0,Afghanistan,AF,AFG,38041757.0,Asia,1.535155
2,2020-09-17,17,9,2020,17,0,Afghanistan,AF,AFG,38041757.0,Asia,1.653446
3,2020-09-16,16,9,2020,40,10,Afghanistan,AF,AFG,38041757.0,Asia,1.708649
4,2020-09-15,15,9,2020,99,6,Afghanistan,AF,AFG,38041757.0,Asia,1.627159


### 1. Estimates of Location
#### Variables with measured or count data might have thousands of distinct values. A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).

### 1.1. Mean (a.k.a Average)
#### The Mean is simly the sum of all values divided by the number of values

#### there are several alternatives to compute the mean of a variable in Python: 
1. use the numpy module
2. use the pandas module


In [5]:
covidDataFrame['cases'].mean()

698.5782972688595

In [6]:
np.mean(covidDataFrame['cases'])

698.5782972688595

### 1.2. Trimmed Mean

#### A variation of the mean is a trimmed mean, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. Representing the sorted values by x(1),x(2),...,x(n) where x(1) is the smallest value and x(n) the largest, the formula to compute the trimmed mean with p smallest and largest values omitted is:

#### A trimmed mean eliminates the influence of extreme values. For example, in international diving the top and bottom scores from five judges are dropped, and the final score is the average of the three remaining judges. This makes it difficult for a single judge to manipulate the score, perhaps to favor his country’s contestant. Trimmed means are widely used, and in many cases, are preferable to use instead of the ordinary mean

In [7]:
# Trim off the 20% most extreme scores (lowest and highest)
stats.trim_mean(covidDataFrame['cases'], proportiontocut=0.1)

89.41374085086916

### 1.3. Median

#### The median is the middle number on a sorted list of the data. If there is an even number of data values, the middle value is one that is not actually in the data set, but rather the average of the two values that divide the sorted data into upper and lower halves. Compared to the mean, which uses all observations, the median depends only on the values in the center of the sorted data. While this might seem to be a disadvantage, since the mean is much more sensitive to the data, there are many instances in which the median is a better metric for location. 

#### The median is referred to as a robust estimate of location since it is not influenced by outliers (extreme cases) that could skew the results. An outlier is any value that is very distant from the other values in a data set. The exact definition of an outlier is somewhat subjective, although certain conventions are used in various data summaries and plots.

#### Being an outlier in itself does not make a data value invalid or erroneous . Still, outliers are often the result of data errors such as mixing data of different units (kilometers versus meters) or bad readings from a sensor.

In [None]:
covidDataFrame['cases'].median()

9.0

In [None]:
np.median(covidDataFrame['cases'])

9.0

### 2. Estimates of Variability
#### Location is just one dimension in summarizing a feature. A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out. At the heart of statistics lies variability: measuring it, reducing it, distinguishing random from real variability, identifying the various sources of real variability, and making decisions in the presence of it.

### 2.1. Standard Deviation and Related Estimates
#### The most widely used estimates of variation are based on the differences, or deviations, between the estimate of location and the observed data. For a set of data {1, 4, 4}, the mean is 3 and the median is 4. The deviations from the mean are the differences: 1 – 3 = –2, 4 – 3 = 1 , 4 – 3 = 1. These deviations tell us how dispersed the data is around the central value.

#### The best-known estimates for variability are the variance and the standard deviation, which are based on squared deviations. The variance is an average of the squared deviations, and the standard deviation is the square root of the variance.


#### there are several alternatives to compute the variance and standard deviation of a variable in Python: 
1. use the numpy module
2. use the pandas module


In [8]:
covidDataFrame['cases'].var()

18942995.121199723

In [9]:
np.var(covidDataFrame['cases'])

18942561.821526334

In [10]:
covidDataFrame['cases'].std()

4352.3551235164305

In [11]:
np.std(covidDataFrame['cases'])

4352.305345621598

### 3. Skewness and Kurtosis
#### A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes skewness and kurtosis.
#### Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

#### Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails. Data sets with low kurtosis tend to have light tails.

In [None]:
covidDataFrame['cases'].skew()

12.440644089924893

In [None]:
sp.stats.skew(covidDataFrame['cases'])

12.440217237278587

In [None]:
covidDataFrame['cases'].kurtosis()

183.35891066556854

In [None]:
sp.stats.kurtosis(covidDataFrame['cases'])

183.33780345440454

#### Pandas provides the method pd.describe() which is quite convenient to compute location and variability estimates.

In [14]:
covidDataFrame.describe()

Unnamed: 0,day,month,year,cases,deaths,popData2019,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
count,43718.0,43718.0,43718.0,43718.0,43718.0,43654.0,40937.0
mean,15.646919,5.61899,2019.998467,698.578297,21.792488,42870540.0,33.001167
std,8.776722,2.206138,0.039118,4352.355124,126.490919,157872000.0,76.067751
min,1.0,1.0,2019.0,-8261.0,-1918.0,815.0,-147.419587
25%,8.0,4.0,2020.0,0.0,0.0,1355982.0,0.370634
50%,16.0,6.0,2020.0,9.0,0.0,8082359.0,4.571738
75%,23.0,7.0,2020.0,150.0,3.0,29161920.0,26.575105
max,31.0,12.0,2020.0,97894.0,4928.0,1433784000.0,1058.225943
