# Univariate and Bivariate Analysis Lab

This data was created <b>based on the following dataset</b>: https://www.kaggle.com/mustafaali96/weight-height

The variables used are:
* Money
* Height
* Weight
* Gender


## Context

We'll use the weight-height dataset to have a deeper understanding of some important variable distributions.

We'll understand how to:
- have a glance on the statistics distribution just by observing the main statistical numerical description of our dataset.
- understand the effect of a normal and non-normal distribution on our outlier analysis
- understand how one variable can impact on another variable distribution

## Import libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Read dataset from `weight-height-money.csv`.

Take a look at the first rows of the dataset.

In [5]:
# your answer here
data=pd.read_csv('./data/weight-height-money.csv')

## Use the describe method to understand the data.

In [6]:
# your answer here
data.describe()

Unnamed: 0,Height,Weight,Money
count,10000.0,10000.0,10000.0
mean,66.36756,161.440357,631.1565
std,3.847528,32.108439,806.0481
min,54.263133,64.700127,1.735664e-07
25%,63.50562,135.818051,89.0672
50%,66.31807,161.212928,339.0872
75%,69.174262,187.169525,851.5404
max,78.998742,269.989699,8168.216


## What can you observe? Compare mean and median values for each variable of our dataset.

In [0]:
# your answer here
'''The mean and the medium are very close for Heigh and weith, but it is very discrepant for Money'''

## From that observation, what can you conclude? Can you imagine which kind of distribution each one has?

Try to predict which category each variable most likely seems to fall into (without plotting it yet):
- Left skewed
- Right skewed
- Gaussian-like

_hint: Remember the effect of outliers on the mean and median values_

In [0]:
# your answer here
'''
Height and weight are gaussian-like
Money is left skewed
'''

Usually, the comparison between mean and median already leads us to meaningful insights regarding the variable's distribution. If mean and median are close, you can suppose that most of the data is concentrated in a region within the mean. However, if median and mean are far apart, you can suppose that some observations are pulling the mean closer to them.

# Univariate Analysis

## Gender count

Count how many Male and Female exist on this dataset using pandas. 

In [7]:
# your answer here
data.groupby(data['Gender']).count()

Unnamed: 0_level_0,Height,Weight,Money
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,5000,5000,5000
Male,5000,5000,5000


## Visual gender count

Use seaborn (sns) to visually see how many male and female exist on the dataset.

_hint: If you don't know how to do this, you can google: seaborn + the pandas method you used on the previous exercise_

In [0]:
# your answer here


## Consider only Height

Create a pandas series of the Height variable

In [0]:
# your answer here

height = 

### Histogram-plot

Plot the histogram of the height

In [0]:
plt.figure(figsize=(12, 8))
# your answer here

### Box-plot

Plot the boxplot of the height

In [0]:
plt.figure(figsize=(12, 8))



What do you think the distribution of height is like? Do you think it is common for variables to behave like that in real life?

### KDE distribution for height

Plot the kde (kernel-density-estimation) distribution (without the hist distribution) of the height.

### Analysis

As we can see we have a high count for height in the range near 60 to 75. How many people fall off 3 standard deviations from the mean? Can you consider them outliers? Why is that?

#### Calculate the mean

In [0]:
# your answer here

#### Calculate the standard deviation 

In [0]:
# your answer here

#### Calculate the values for the cutoffs:

`upper_cutoff = mean` <b><span style="color:red">+</span></b> `3 * standard_deviation` 

`lower_cutoff = mean` <b><span style="color:red">-</span></b> `3 * standard_deviation`

#### Now filter the original dataset. 

Use the values you calculated above to filter the original dataset. 

You should obtain a filtered dataset containing only the rows in which the `Height` column is greater than the upper cutoff and lower than the lower cutoff.

In [0]:
# expected results:

Unnamed: 0,Gender,Height,Weight
994,Male,78.095867,255.690835
1317,Male,78.462053,227.342565
2014,Male,78.998742,269.989699
3285,Male,78.52821,253.889004
3757,Male,78.621374,245.733783
6624,Female,54.616858,71.393749
9285,Female,54.263133,64.700127


#### Finally, calculate the shape of this filtered dataset and compare with the original dataframe.

Which percentage felt in these thresholds? Did you expect this value? Why?

## Now perform the same analysis for `money` variable.

You'll do exactly the same analysis for a variable of your dataset that has some different aspects. Let's try to understand that.

### Considering only money

Generate a variable called money from the dataset

In [0]:
# your code here
money = 

In [0]:
money.shape

(10000,)

### Histogram-plot

In [0]:
# your code here

### Box-plot

In [0]:
# your code here

### KDE distribution for height

In [0]:
# your code here

### Analysis

Again, how many people fall off 3 standard deviations in both cases (which percentage of the dataset)? Let's do it by parts:

#### Calculate the mean

In [0]:
# your answer here

#### Calculate the standard deviation 

In [0]:
# your answer here.


#### Calculate the values for the cutoffs:

`upper_cutoff = mean` <b><span style="color:red">+</span></b> `3 * standard_deviation` 

`lower_cutoff = mean` <b><span style="color:red">-</span></b> `3 * standard_deviation`

#### Again, filter the original dataset. 

In [0]:
# your answer here

#### Finally, calculate the shape of this filtered dataset and compare with the original dataframe.

Which percentage felt in these thresholds? Did you expect this value?

In [0]:
# your answer here

Can you consider them outliers?

In [0]:
# your answer here

By now, you should have observed that in order to consider an observation an `outlier`, one needs to take into account the **distribution** of the variable. In fact, most statistical aspects do not mean anything until you understand the variable's distribution.

# Bivariate Analysis

## Considering both height and weight

Now we'll consider both height and weight variables to understand our data. We'll perform what is called a bivariate analysis.

### Perform a scatterplot to check the relation between Height and Weight

In [0]:
plt.figure(figsize=(12, 8))



### Do the same plot, but color the markers by Gender

In [0]:
plt.figure(figsize=(12, 8))


### Which insights hue'ing the plot using the variable Gender has brought to you? 

In [0]:
# your answer here

### Create a variable called `gender_groupby` to group data by 'Gender'. However, don't define any aggregations yet. 

Just perform the groupby operation.

In [0]:
# your code here
gender_groupby = 

#### Run `gender_groupby.head()` to check the groups obtained.

In [0]:
gender_groupby.head()

#### Run gender_groupby.describe() to check the statistics for each group. 

_hint: You can transpose this result to obtain a better visualization of the results_


In [0]:
gender_groupby.describe()

The results above should give you some insights of the effect of gender on your dataset both visually and numerically.

## Distribution plots

### Verifying the distribution of your variables for each Gender.

We have seen that each variable in our dataset has its own "DNA". This is the distribution of each variable. However, the story does not stop there. Each variable has peculiarities within it and it is our job as data analyst to discover it. We'll see futurely that what `machine learning models` mostly do for us is automate this process (if we understand them, of course) 

In this case, specifically, we want to understand the effect of our `Gender` variable on the distribution of our dataset.


### First, plot the distribution of the `Height` variable again.

In [0]:
# your answer here

### Now, filter your dataset for each gender. Create a dataframe called `men` and another called `women` and plot the `Height` distribution for each of them in the same plot.

In [0]:
# your answer here

### What insights could you observe from that? What is the impact of `Gender` on `Height`

**Extra Note:** The issue: https://github.com/mwaskom/seaborn/issues/861 has a discussion of why distplot doesn't have a `hue` argument and how to overcome it (look for FacetGrid)

Try to do the same for the `Money` variable. What is the impact of `Gender` on `Money` ?

_Hint: for the Money variable, try specifying `hist=False`_


## Boxplot 

### Gender vs Height

Plot the boxplot considering the x-axis as `Gender` and y-axis as `Height`

In [0]:
plt.figure(figsize=(12, 8))



### Gender vs Money

In [0]:
plt.figure(figsize=(12, 8))


From the conclusions of the previous exercises, did you expect the bloxplots to be like the ones above?

In [0]:
# your answer here

### Multivariate Analysis

Use `sns.pairplot` to see some combinations obtained so far. Use `hue = 'Gender`. 

Note that in a real problem, pairplot starts to get messy since there can be countless number of variables in a dataset. Use it wisely.

Usually people only plot this graph and don't take any conclusions from it. Don't fall into that trap. 