# Exploratory Data Analysis - A brief introduction

In this tutorial you will get yourself familiar with general python tools for exploratory data analysis. We prepared examples on the most common data manipulations and plotting. After each new concept there are some exercises as well. So that you can practise what you've just learned.


The exercises are annotated as follows:
---------------------------------------

### Exercise

text of the task

In [None]:
# your code here

---------------------------------------

You need to provide your code on the basis of what you've just learned. Or, if you know another way, you are welcome to be creative. All the tasks are possible to solve with just scrolling up the tutorial and reading carefully. But you can google or read the  literature provided if you want to build something extra. Well, let's start our Exploratory Data Analysis (EDA) journey!


# Getting Started with EDA


## Information

One of the most popular software tool stacks for data science is __PyData__, a collection of software packages and tools within Python itself. We already know by now that Python is one of the most popular languages for data science, largely thanks to its excellent developer eco-system and variety of powerful open-source tools and packages.

![](https://i.imgur.com/2yfeMP0.png)

The most popular packages in the PyData stack which you will end up using throughout this course and even in your journey and career as a Data Scientist will include the following:

- NumPy: A popular linear algebra library for fast numeric computations on vectors and matrices in Python.
- SciPy: A popular scientific computing library which has several mathematical and statistical functions including, distributions, statistical tests, integration, differentiation and optimization, which can be used with NumPy objects.
- Pandas: Builds on NumPy and provides tabular data structures like dataframes which can be used to work with tabular data.
- Scikit-learn: The most popular Python machine learning library, includes many popular models to perform predictive analytics, modeling, diagnostics as well as data processing and wrangling.
- StatsModels: Similar to scikit-learn, aimed at performing descriptive and inferential statistics.
- Matplotlib: A popular generic plotting library including histograms, line, bar, and scatter plots. Can also be accessed through pandas.
- Seaborn: A visualization library building on matplotlib, including plotting of distributions and heatmaps with better aesthetics and visuals.


## Installation and Loading

The easiest option is to use [Anaconda](https://www.anaconda.com/products/individual) to download and install the necessary python distribution for your computer's operating system and it comes pre-installed with all necessary packages.

However if you want to install specific packages only you can use `pip install <package_name>` or `conda install <package_name>`

Once we installed the packages it is time to load it. Let's start with "pandas".

In [None]:
import pandas as pd
%matplotlib inline

Now we are ready to explore the data

# Load the data 

We start with loading the data and then analyze it with `pandas` and Python to subset, slice, sort, group and visualize!

In this tutorial, we'll be exploring the 'Star Wars' dataset, where each observation is a character and each variable is a feature such as height, mass or hair_color. 

To check out the first several observations of your dataframe, we first need to read in the file. For this we use the `pandas` [`read_csv`](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html) function. We always recommend checking the documentation which usually has a wealth of information available about specific functions and their roles. 

In [None]:
df = pd.read_csv('data/pw_stats_01_star_wars.csv')

To see the type of the variable you just created you can use the `type` function as depicted below

In [None]:
type(df)

pandas.core.frame.DataFrame

To view your data you can just type in the variable name as follows

In [None]:
df

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112BBY,,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33BBY,,Naboo,Droid
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19BBY,female,Alderaan,Human
...,...,...,...,...,...,...,...,...,...,...
82,Rey,,,brown,light,hazel,,female,,Human
83,Poe Dameron,,,brown,light,brown,,male,,Human
84,BB8,,,none,none,black,,none,,Droid
85,Captain Phasma,,,,,,,female,,


Note: the output of this operation is quite big. 

To view the first few rows of the dataframe you can use the `head()` function

In [None]:
df.head()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112BBY,,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33BBY,,Naboo,Droid
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19BBY,female,Alderaan,Human


To view the last few rows similarly you can use the `tail()` function

In [None]:
df.tail()

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
82,Rey,,,brown,light,hazel,,female,,Human
83,Poe Dameron,,,brown,light,brown,,male,,Human
84,BB8,,,none,none,black,,none,,Droid
85,Captain Phasma,,,,,,,female,,
86,Padmé Amidala,165.0,45.0,brown,light,brown,46BBY,female,Naboo,Human


To get a quick check on the dimensions of the dataframe you can use the `shape` attribute

In [None]:
df.shape

(87, 10)

Looks like we have 87 rows and 10 columns in our dataset

You can also get detailed information about each column of our dataframe using the `info()` function as follows.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        87 non-null     object 
 1   height      81 non-null     float64
 2   mass        59 non-null     float64
 3   hair_color  81 non-null     object 
 4   skin_color  85 non-null     object 
 5   eye_color   84 non-null     object 
 6   birth_year  43 non-null     object 
 7   gender      84 non-null     object 
 8   homeworld   77 non-null     object 
 9   species     82 non-null     object 
dtypes: float64(2), object(8)
memory usage: 6.9+ KB


Looks like there are quite a number of columns with missing values!

# Data Wrangling with "pandas"

Now it's time to explore your data and get some initial insight into the dataset. You'll be using __`pandas`__ functions and operations in this section.

Let's assume that you want to choose a particular set of observations, say, those for which the "species" was 'Droid'. 

## To select rows based on condition

Select all rows where species is Droid:

In [None]:
df['species'] == 'Droid'

0     False
1      True
2      True
3     False
4     False
      ...  
82    False
83    False
84     True
85    False
86    False
Name: species, Length: 87, dtype: bool

The above condition shows us for which rows the `species` column has the value 'Droid' (identified by `True`) and for which rows it has some other value (identified by `False`). We can then use the following subset expression to only show the rows where `species` is `Droid`

In [None]:
df[df['species'] == 'Droid']

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
1,C-3PO,167.0,75.0,,gold,yellow,112BBY,,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33BBY,,Naboo,Droid
7,R5-D4,97.0,32.0,,"white, red",red,,,Tatooine,Droid
21,IG-88,200.0,140.0,none,metal,red,15BBY,none,,Droid
84,BB8,,,none,none,black,,none,,Droid


Now select all rows where species is not Droid

In [None]:
df[df['species'] != 'Droid']

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19BBY,female,Alderaan,Human
5,Owen Lars,178.0,120.0,"brown, grey",light,blue,52BBY,male,Tatooine,Human
6,Beru Whitesun lars,165.0,75.0,brown,light,blue,47BBY,female,Tatooine,Human
...,...,...,...,...,...,...,...,...,...,...
81,Finn,,,black,dark,dark,,male,,Human
82,Rey,,,brown,light,hazel,,female,,Human
83,Poe Dameron,,,brown,light,brown,,male,,Human
85,Captain Phasma,,,,,,,female,,


The above can also be done in another way using the `~` operator as follows.

In [None]:
df[~(df['species'] == 'Droid')]

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human
4,Leia Organa,150.0,49.0,brown,light,brown,19BBY,female,Alderaan,Human
5,Owen Lars,178.0,120.0,"brown, grey",light,blue,52BBY,male,Tatooine,Human
6,Beru Whitesun lars,165.0,75.0,brown,light,blue,47BBY,female,Tatooine,Human
...,...,...,...,...,...,...,...,...,...,...
81,Finn,,,black,dark,dark,,male,,Human
82,Rey,,,brown,light,hazel,,female,,Human
83,Poe Dameron,,,brown,light,brown,,male,,Human
85,Captain Phasma,,,,,,,female,,


### Exercise 

Find species which are human

In [None]:
#your code here


## To select rows based on positions

You can use the `iloc` functionality to select specific rows based on index (position).

Select the first three rows of the dataframe

In [None]:
df.iloc[0:3, :]

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human
1,C-3PO,167.0,75.0,,gold,yellow,112BBY,,Tatooine,Droid
2,R2-D2,96.0,32.0,,"white, blue",red,33BBY,,Naboo,Droid


Select rows 10, 15 and 17 of the dataframe

In [None]:
df.iloc[[10, 15, 17], :]

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species
10,Anakin Skywalker,188.0,84.0,blond,fair,blue,41.9BBY,male,Tatooine,Human
15,Jabba Desilijic Tiure,175.0,1358.0,,"green-tan, brown",orange,600BBY,hermaphrodite,Nal Hutta,Hutt
17,Jek Tono Porkins,180.0,110.0,brown,fair,blue,,male,Bestine IV,Human


### Exercise 

Select rows 70 and 75 of the dataframe and display them

In [None]:
#your code here


## To select columns based on position

Select the column names based on the column number using `iloc`

Show the first three columns of the dataframe

In [None]:
df.iloc[:, 0:3]

Unnamed: 0,name,height,mass
0,Luke Skywalker,172.0,77.0
1,C-3PO,167.0,75.0
2,R2-D2,96.0,32.0
3,Darth Vader,202.0,136.0
4,Leia Organa,150.0,49.0
...,...,...,...
82,Rey,,
83,Poe Dameron,,
84,BB8,,
85,Captain Phasma,,


Show columns 0, 3 and 6 of the dataframe

In [None]:
df.iloc[:, [0, 3, 6]]

Unnamed: 0,name,hair_color,birth_year
0,Luke Skywalker,blond,19BBY
1,C-3PO,,112BBY
2,R2-D2,,33BBY
3,Darth Vader,none,41.9BBY
4,Leia Organa,brown,19BBY
...,...,...,...
82,Rey,brown,
83,Poe Dameron,brown,
84,BB8,none,
85,Captain Phasma,,


### Exercise 

Show columns 0 and 5 of the dataframe and display them

In [None]:
#your code here


## To select columns based on name

Select the column names based on the column number using `loc`

Show the columns, `name` and `hair_color` of the dataframe

In [None]:
df.loc[:, ['name', 'hair_color']]

Unnamed: 0,name,hair_color
0,Luke Skywalker,blond
1,C-3PO,
2,R2-D2,
3,Darth Vader,none
4,Leia Organa,brown
...,...,...
82,Rey,brown
83,Poe Dameron,brown
84,BB8,none
85,Captain Phasma,


Alternatively you can also use the following

In [None]:
df[['name', 'hair_color']]

Unnamed: 0,name,hair_color
0,Luke Skywalker,blond
1,C-3PO,
2,R2-D2,
3,Darth Vader,none
4,Leia Organa,brown
...,...,...
82,Rey,brown
83,Poe Dameron,brown
84,BB8,none
85,Captain Phasma,


## To create new variables

Create a new column 'bmi' based on two other columns 'mass' and 'height'

bmi = mass / height ^ 2 (metres squared)

Let's check the data types for the columns 

In [None]:
df.dtypes

name           object
height        float64
mass          float64
hair_color     object
skin_color     object
eye_color      object
birth_year     object
gender         object
homeworld      object
species        object
dtype: object

In [None]:
df['bmi'] = df['mass'] / (df['height'] / 100) ** 2

In [None]:
df

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species,bmi
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human,26.027582
1,C-3PO,167.0,75.0,,gold,yellow,112BBY,,Tatooine,Droid,26.892323
2,R2-D2,96.0,32.0,,"white, blue",red,33BBY,,Naboo,Droid,34.722222
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human,33.330066
4,Leia Organa,150.0,49.0,brown,light,brown,19BBY,female,Alderaan,Human,21.777778
...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,,,brown,light,hazel,,female,,Human,
83,Poe Dameron,,,brown,light,brown,,male,,Human,
84,BB8,,,none,none,black,,none,,Droid,
85,Captain Phasma,,,,,,,female,,,


Let's make it look a bit better by round off the decimal places

In [None]:
df['bmi'] = round(df['bmi'], 2)
df

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species,bmi
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human,26.03
1,C-3PO,167.0,75.0,,gold,yellow,112BBY,,Tatooine,Droid,26.89
2,R2-D2,96.0,32.0,,"white, blue",red,33BBY,,Naboo,Droid,34.72
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human,33.33
4,Leia Organa,150.0,49.0,brown,light,brown,19BBY,female,Alderaan,Human,21.78
...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,,,brown,light,hazel,,female,,Human,
83,Poe Dameron,,,brown,light,brown,,male,,Human,
84,BB8,,,none,none,black,,none,,Droid,
85,Captain Phasma,,,,,,,female,,,


A BMI of 25 or more is 'overweight' and less than that is 'healthy'. Let's create a new column `bmi_health` to show this.

We can use a lambda function along with `apply()` to perform this operation and transformation

In [None]:
df['bmi'].apply(lambda value: 'overweight' if value >= 25 else 'healthy')

0     overweight
1     overweight
2     overweight
3     overweight
4        healthy
         ...    
82       healthy
83       healthy
84       healthy
85       healthy
86       healthy
Name: bmi, Length: 87, dtype: object

Let's assign this to the new column now

In [None]:
df['bmi_health'] = df['bmi'].apply(lambda value: 'overweight' if value >= 25 else 'healthy')
df

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species,bmi,bmi_health
0,Luke Skywalker,172.0,77.0,blond,fair,blue,19BBY,male,Tatooine,Human,26.03,overweight
1,C-3PO,167.0,75.0,,gold,yellow,112BBY,,Tatooine,Droid,26.89,overweight
2,R2-D2,96.0,32.0,,"white, blue",red,33BBY,,Naboo,Droid,34.72,overweight
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human,33.33,overweight
4,Leia Organa,150.0,49.0,brown,light,brown,19BBY,female,Alderaan,Human,21.78,healthy
...,...,...,...,...,...,...,...,...,...,...,...,...
82,Rey,,,brown,light,hazel,,female,,Human,,healthy
83,Poe Dameron,,,brown,light,brown,,male,,Human,,healthy
84,BB8,,,none,none,black,,none,,Droid,,healthy
85,Captain Phasma,,,,,,,female,,,,healthy


## Chaining several operations 

You can also chain multiple functions and operations in `pandas` as follows

In [None]:
df['bmi_health_new'] = (
                        (round((df['mass'] / (df['height'] / 100) ** 2), 2))
                        .apply(lambda value: 'overweight' if value >= 25 else 'healthy')
                       )
df[['name', 'bmi', 'bmi_health', 'bmi_health_new']]

Unnamed: 0,name,bmi,bmi_health,bmi_health_new
0,Luke Skywalker,26.03,overweight,overweight
1,C-3PO,26.89,overweight,overweight
2,R2-D2,34.72,overweight,overweight
3,Darth Vader,33.33,overweight,overweight
4,Leia Organa,21.78,healthy,healthy
...,...,...,...,...
82,Rey,,healthy,healthy
83,Poe Dameron,,healthy,healthy
84,BB8,,healthy,healthy
85,Captain Phasma,,healthy,healthy


### Exercise 

Create a new column `height_qual` which should have the value `tall` if `height` is more than or equal to 180 else it should be `not tall`

In [None]:
#your code here


## To sort values in the dataframe

Sort values by descending mass:

In [None]:
df.sort_values(by=['mass'], ascending=False).head(10)

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species,bmi,bmi_health,bmi_health_new
15,Jabba Desilijic Tiure,175.0,1358.0,,"green-tan, brown",orange,600BBY,hermaphrodite,Nal Hutta,Hutt,443.43,overweight,overweight
76,Grievous,216.0,159.0,none,"brown, white","green, yellow",,male,Kalee,Kaleesh,34.08,overweight,overweight
21,IG-88,200.0,140.0,none,metal,red,15BBY,none,,Droid,35.0,overweight,overweight
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human,33.33,overweight,overweight
77,Tarfful,234.0,136.0,brown,brown,blue,,male,Kashyyyk,Wookiee,24.84,healthy,healthy
5,Owen Lars,178.0,120.0,"brown, grey",light,blue,52BBY,male,Tatooine,Human,37.87,overweight,overweight
22,Bossk,190.0,113.0,none,green,red,53BBY,male,Trandosha,Trandoshan,31.3,overweight,overweight
12,Chewbacca,228.0,112.0,brown,,blue,200BBY,male,Kashyyyk,Wookiee,21.55,healthy,healthy
17,Jek Tono Porkins,180.0,110.0,brown,fair,blue,,male,Bestine IV,Human,33.95,overweight,overweight
67,Dexter Jettster,198.0,102.0,none,brown,yellow,,male,Ojom,Besalisk,26.02,overweight,overweight


### Exercise 
Show 10 shortest characters 

In [None]:
#your code here


## To summarize data by specific groupings, use groupby() 

Summarise the number of observations and the average mass for each species:

In [None]:
df.groupby(by=['species']).agg({
    'mass': 'mean',
    'species': 'count'
}).rename(columns={
    'mass': 'mean_mass',
    'species': 'count'
})

Unnamed: 0_level_0,mean_mass,count
species,Unnamed: 1_level_1,Unnamed: 2_level_1
Aleena,15.0,1
Besalisk,102.0,1
Cerean,82.0,1
Chagrian,,1
Clawdite,55.0,1
Droid,69.75,5
Dug,40.0,1
Ewok,20.0,1
Geonosian,80.0,1
Gungan,74.0,3


Show the top three species in our dataset

In [None]:
result = df.groupby(by=['species']).agg({
    'mass': 'mean',
    'species': 'count'
}).rename(columns={
    'mass': 'mean_mass',
    'species': 'count'
})

result.sort_values(by=['count'], ascending=False).head(3)

Unnamed: 0_level_0,mean_mass,count
species,Unnamed: 1_level_1,Unnamed: 2_level_1
Human,82.781818,35
Droid,69.75,5
Gungan,74.0,3


### Exercise
In which planet of origin (homeworld) are the tallest inhabitants on average?

In [None]:
#your code here


## Long chain 

Choose the tallest (> 175) & heaviest (> 100) characters, 
group them by homeworld and species,
show their average height and mass

In [None]:
result = df[(df['height'] > 175) & (df['mass'] > 100)]
result

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species,bmi,bmi_health,bmi_health_new
3,Darth Vader,202.0,136.0,none,white,yellow,41.9BBY,male,Tatooine,Human,33.33,overweight,overweight
5,Owen Lars,178.0,120.0,"brown, grey",light,blue,52BBY,male,Tatooine,Human,37.87,overweight,overweight
12,Chewbacca,228.0,112.0,brown,,blue,200BBY,male,Kashyyyk,Wookiee,21.55,healthy,healthy
17,Jek Tono Porkins,180.0,110.0,brown,fair,blue,,male,Bestine IV,Human,33.95,overweight,overweight
21,IG-88,200.0,140.0,none,metal,red,15BBY,none,,Droid,35.0,overweight,overweight
22,Bossk,190.0,113.0,none,green,red,53BBY,male,Trandosha,Trandoshan,31.3,overweight,overweight
67,Dexter Jettster,198.0,102.0,none,brown,yellow,,male,Ojom,Besalisk,26.02,overweight,overweight
76,Grievous,216.0,159.0,none,"brown, white","green, yellow",,male,Kalee,Kaleesh,34.08,overweight,overweight
77,Tarfful,234.0,136.0,brown,brown,blue,,male,Kashyyyk,Wookiee,24.84,healthy,healthy


In [None]:
result.groupby(by=['homeworld', 'species'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9b3cc90460>

In [None]:
result = result.groupby(by=['homeworld', 'species']).agg({
    'height': 'mean',
    'mass': 'mean'
})

result

Unnamed: 0_level_0,Unnamed: 1_level_0,height,mass
homeworld,species,Unnamed: 2_level_1,Unnamed: 3_level_1
Bestine IV,Human,180.0,110.0
Kalee,Kaleesh,216.0,159.0
Kashyyyk,Wookiee,231.0,124.0
Ojom,Besalisk,198.0,102.0
Tatooine,Human,190.0,128.0
Trandosha,Trandoshan,190.0,113.0


In [None]:
result = result.rename(columns={
    'height': 'mean_height',
    'mass': 'mean_mass'
})

result

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_height,mean_mass
homeworld,species,Unnamed: 2_level_1,Unnamed: 3_level_1
Bestine IV,Human,180.0,110.0
Kalee,Kaleesh,216.0,159.0
Kashyyyk,Wookiee,231.0,124.0
Ojom,Besalisk,198.0,102.0
Tatooine,Human,190.0,128.0
Trandosha,Trandoshan,190.0,113.0


Chain all the above together!

In [None]:
(df[(df['height'] > 175) & (df['mass'] > 100)]
   .groupby(by=['homeworld', 'species'])
   .agg({
       'height': 'mean',
       'mass': 'mean'
    })
   .rename(columns={
       'height': 'mean_height',
       'mass': 'mean_mass'
    })
)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_height,mean_mass
homeworld,species,Unnamed: 2_level_1,Unnamed: 3_level_1
Bestine IV,Human,180.0,110.0
Kalee,Kaleesh,216.0,159.0
Kashyyyk,Wookiee,231.0,124.0
Ojom,Besalisk,198.0,102.0
Tatooine,Human,190.0,128.0
Trandosha,Trandoshan,190.0,113.0


### Exercise

Choose the people who have blue eyes, 
group them by homeworld and species,
show their average height and mass

In [None]:
#your code here


# Plotting your data

To plot your data there are several libraries including `matplotlib` and `seaborn` which you will also be introduced to later on in the course. Here we will use the basic plotting functionality of `pandas` itself.

##  Barplot of gender

First let's plot a barplot of 'gender' to see how many of them were in the movie. 


In [None]:
df['gender'].value_counts()

male             62
female           19
none              2
hermaphrodite     1
Name: gender, dtype: int64

In [None]:
df['gender'].value_counts().plot(kind='bar');

From the above plot, you can tell that there were about 60 men and 20 women involved in action, 
as well as 1 hermaphrodite and probably 2 free from gender

## Scatter plot of mass vs height

Now it's time for some scatter plots. Is the height of characters correlated with the mass?

For scatter plot we need two variables : x and y. And we need to add a scatter layer to our plot.

In [None]:
df.plot.scatter(x='mass', y='height');

It seems that they are correlated, but it is hard to tell more precise, because we have an outlier! 
In this situation we can do 2 things: 

 1) Remove the outlier if it is an error (for example, if it does not represent our population)
 
 2) investigate further, what is so special about this outlier. Is it a real character or an error of some sort

### Without outlier
1) Let's check the plot without outlier

Adding a layer xlim() we set a limit to our x variable (which is 'mass' in our case), so the outlier is not shown [we use matplotlib for this]

Note: How to detect outliers with a quantile function you will learn during the course

In [None]:
import matplotlib.pyplot as plt

df.plot.scatter(x='mass', y='height')
plt.xlim([0, 200]);

yes, it seems that we have a correlation between mass and height! 

### With outlier

2) Let's take a closer look at the outlier. Well, who it might be? 


In [None]:
outlier = df[df['mass'] > 1000]
outlier

Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,species,bmi,bmi_health,bmi_health_new
15,Jabba Desilijic Tiure,175.0,1358.0,,"green-tan, brown",orange,600BBY,hermaphrodite,Nal Hutta,Hutt,443.43,overweight,overweight


In [None]:
outlier_mass = outlier['mass'].values[0]
outlier_height = outlier['height'].values[0]
outlier_name = outlier['name'].values[0]

In [None]:
fig, ax = plt.subplots()
df.plot.scatter(x='mass', y='height', ax=ax)
ax.text(outlier_mass, outlier_height, outlier_name);

Well, there is no surprise that Jabba is a heaviest guy in the starwars world   

### Exercise

Visualize a Scatterplot of mass vs BMI

In [None]:
#your code here


### Exercise

Visualize a Scatterplot of mass vs BMI with the outlier removed

In [None]:
#your code here


## Histogram

Histogram is very useful to visualize distributions of continuous data based on raw frequencies

In [None]:
df['height'].plot(kind='hist');

## Density

Density plots help in visualizing the actual distribution of a column

In [None]:
df['height'].plot(kind='density');

### Exercise 

Visualize a histogram and density plot for the `mass` column

In [None]:
#your code here


In [None]:
#your code here


## Boxplot

The boxplot compactly displays the distribution of a continuous variable. It visualises five summary statistics (the median, two hinges and two whiskers), and all "outlying" points individually.


In [None]:
df['height'].plot(kind='box');

### Exercise 

Plot the boxplot for `mass`

In [None]:
#your code here


Filter the dataframe to exclude rows with `mass` > 400 and plot a boxplot for `mass` again

In [None]:
#your code here


# Congratulations!

You've finished all the steps in our tutorial and now you are one step closer to the basics of data analysis with python and `pandas`.

We learned how to summarize your data, how to subset and slice by rows and by columns, how to mutate it and plot it. The language we used to write the commands remind the language we normally use when talk about data. With little practise you'll soon be able to deliver meaningful conclusions for making decisions. 

Good luck with the rest of the exercises and we are looking forward to discuss with you the beauty of Statistics during the course!