# Data Exploration in Python

Rafael Martínez (rmartinez@gradiant.org)

----------------




## Table of contents


1. [Data wrangling](#wrangling)
    - [Data import](#import)
    - [Data structure](#structure)
2. [Exploratory Data Analysis](#EDA)
    - [Dealing with missing values](#NaN)
    - [Exploring the variation of my variables](#variation)
    - [Exploring the covariation between my variables](#covariation)
    - [Data visualization](#plots)
   

Before starting, it is necessary to load the required modules: `pandas`, `numpy`, `matplotlib` and `seaborn`

In [None]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()
sns.set_context('talk')

<a id='wrangling'></a>

# Data wrangling

This is the art of getting your data into Python in a useful form for visualisation and modelling. Data wrangling is very important: without it you can’t work with your own data!

This phase is divided into data import and the idea of tidy data (how you can organise your data in Python).

<a id='import'></a>

## Data import 

In this part, we are going to learn how to get your data from disk and into Python. There ara several functions to load data in Python. You must choose the ideal depending on the data format. The best choice when dealing with relational data (e.g. CSV) is often Pandas dataframes.

In [None]:
# http://data.insideairbnb.com/spain/catalonia/barcelona/2018-09-11/visualisations/listings.csv

data = pd.read_csv('airbnb.csv')

<a id='structure'></a>

## Data structure

Now we will have a look at the data, their size and their structure.

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.head()

In [None]:
type(data)

Next, you can see some examples of how to select data from a dataframe

In [None]:
data.iloc[0] # select first row

In [None]:
data.iloc[0:3] # select first three rows

In [None]:
data.iloc[-3:] # select 3 last rows

In [None]:
data['id'] # select first column (equivalent to data[data.columns[0]])

In [None]:
data[data.columns[1:5]] # select columns 2 to 5

In [None]:
data.id[0:10] # select column by name (another way)

In [None]:
data.head(6) # select first N rows

In [None]:
data.tail(3) # select last N rows

In [None]:
data.host_name.iloc[0] # first element of a list

In [None]:
data.host_name.iloc[-1] # last element of a list

<a id='EDA'></a>

# Exploratory Data Analysis


The goal during Exploratory Data Analysis (EDA) is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

* What type of variation occurs within my variables?

* What type of covariation occurs between my variables?






**Some remarks**:
* A *variable* is a quantity, quality, or property that you can measure.

* A *value* is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

* An *observation* is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.



Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.  You can see here some examples:
<a id='import'></a>

<a id='NaN'></a>

## Dealing with missing values

But before starting, a common task in data analysis is dealing with missing values. In Python, missing values are often represented by `NaN`.

The first step is to identify this values. To this end, you can use `isnull()` or `isna()` to count all non-missing values. This methods return a logical vector with `True` in the element locations that contain missing values represented by `NaN`.

In [None]:
data.isnull().sum()

A very useful function to see some stats on the numerical data (including the number of non-missings) is `pd.describe()`.

In [None]:
data.describe()

Now, we can delete these observations or we can recode them.

In order to recode missing values or recode specific indicators that represent missing values, we can use normal subsetting and assignment operations. For example, we can recode missing values in vector `x` with the mean values in `x` by first subsetting the vector to identify `NaNs` and then assign these elements a value.

In [None]:
x = pd.Series([1, 2, 3, 4, np.nan, 6, 7, np.nan])

In [None]:
x.interpolate() # interpolate values, default method='linear'

In our analysis, we are going to exclude these cases directly. To this end, we could use de 'dropna()' function.

In [None]:
x.dropna()

In [None]:
data = data.dropna() # Warning! here we change our dataset
data.shape

<a id='variation'></a>

## Exploring the variation of my variables

At this stage, we'll see some example to analyze one variable paying attention of their type (numerical or categorical). In the case of a numerical variable, it is common to use the `describe()` function and in the case of a categorical variable we can obtain frequencies for each of the levels of the variable.

In [None]:
data.price.describe() # numerical

In [None]:
data.room_type.describe() # categorical

In [None]:
data.room_type.value_counts() # absolute frequencies

In [None]:
data.room_type.value_counts()/data.room_type.count() # relative frequencies

<a id='covariation'></a>

## Exploring the covariation between my variables

Now, we'll try to understand which is the relation between some variables and obtain results for the combination of them.

In [None]:
data[(data['neighbourhood_group'] == 'Ciutat Vella') & (data['room_type'] == 'Private room')]

In [None]:
# Other examples:
# data[(data['neighbourhood_group'] == 'Ciutat Vella') | (data['neighbourhood_group'] == 'Gràcia')]
# data[data['neighbourhood_group'].isin(['Ciutat Vella', 'Gràcia'])]

The function `sort_values()` takes a dataframe and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

In [None]:
data.sort_values('number_of_reviews', ascending=False)

Besides selecting sets of existing columns, sometimes it's very useful to add new columns that are obtained from the existing ones.

In [None]:
data['ri_price'] = (data['price'] > np.percentile(data['price'], 25)) & (data['price'] < np.percentile(data['price'], 75))
data.head()

To rapidly zoom-in on a specific subset:

In [None]:
data[['name', 'price']].head()

In order to answer questions like, for example, how many distinct `neighbourhood` are there in `Ciutat Vella`?

In [None]:
pd.unique(data[data['neighbourhood_group'] == 'Ciutat Vella']['neighbourhood']).tolist()

In [None]:
pd.unique(data['neighbourhood']).tolist() # list all neighbourhoods

Finally, `groupby()` is one of the tools that you'll use most in data exploration

In [None]:
data.groupby(['neighbourhood_group'])['price'].mean()

In [None]:
data.groupby(['room_type'])['price'].describe()

In [None]:
data.groupby(['neighbourhood_group', 'room_type'])['price'].describe()

<a id='plots'></a>

## Data visualization

Finally we will see how to visualise your data using `matplotlib` and `seaborn`.
 - `matplotlib` is the most used module
 - `seaborn` provides more powerful and elegant visualizations


### Univariate analysis

The way that you visualize the distribution of a variable depend on the type of variable you have: categorical or numerical. A variable is **categorical** if it can only take one of a small set of values. In Python, categorical variables are usually stored as char strings. To examine the distribution of a categorical variable, use a bar chart:

In [None]:
fig = plt.figure(figsize = (16, 8))
ng = data.groupby(['neighbourhood_group'])['id'].count()
plt.bar(ng.keys(), ng) # bar plot with matplotlib

In [None]:
fig = plt.figure(figsize = (16, 8))
sns.countplot(x="neighbourhood_group", data=data) # countplot with seaborn

In [None]:
fig = plt.figure(figsize = (16, 8))
sns.countplot(x='neighbourhood_group', hue='room_type', data=data) # countplot with seaborn

A variable is **numerical** if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of numerical variables. In order to see the distribution of a numerical variable, you can use a histogram or a density plot:

In [None]:
fig = plt.figure(figsize = (16, 8))
plt.hist(data['price'], 30) # histogram with matplotlib
plt.xlabel('price')

# it seems that there are some outliers

In [None]:
fig = plt.figure(figsize = (16, 8))
sns.distplot(data['price'], kde=False) # histogram with seaborn

In [None]:
data_filtered = data[data['price'] < 200] # Let's zoom in the (0, 200) interval

In [None]:
fig = plt.figure(figsize = (16, 8))
plt.hist(data_filtered['price'], 30)
plt.xlabel('price')

In [None]:
fig = plt.figure(figsize = (16, 8))
sns.distplot(data_filtered['price'], kde=False)

In [None]:
fig = plt.figure(figsize = (16, 8))
plt.boxplot(data_filtered['price']) # boxplot with matplotlib
plt.ylabel('price')

In [None]:
fig = plt.figure(figsize = (16, 8))
sns.boxplot(x=data_filtered['price']) # boxplot with seaborn

### Multivariate analysis

From now onwards, we are going to describe the behaviour between variables ("covariation"). It is the tendency for the values of two or more variables to vary together in a related way. To this end, the best option is to visualise the relationship between two or more variables. How you do that depends on again the type of variables involved.

If you want to explore the distribution of a **numerical** variable broken down by a **categorical** variable, you can use some of these plots:

In [None]:
fig = plt.figure(figsize = (16, 10))
sns.violinplot(x='price', y='room_type', data=data_filtered, split=True)

In [None]:
fig = plt.figure(figsize = (16, 10))
sns.boxplot(x='neighbourhood', y='price', data=data_filtered[data_filtered['neighbourhood_group'] == "Gràcia"])

In [None]:
fig = plt.figure(figsize = (16, 10))
sns.boxplot(x='neighbourhood_group', y='price', hue='room_type', data=data_filtered)

To visualise the relation between **categorical variables**, you’ll need to count the number of observations for each combination. One way to do that is to rely on several built-in functions in `seaborn`, as `heatmap()`, `jointplot()` or `lmplot()`.

In [None]:
fig = plt.figure(figsize = (16, 10))
gb = data_filtered.groupby(['neighbourhood_group', 'room_type']).count()['id'].unstack(level=-1)
sns.heatmap(gb, cmap='coolwarm')

In [None]:
fig = plt.figure(figsize = (16, 28))
gb = data_filtered.groupby(['neighbourhood', 'neighbourhood_group']).count()['id'].unstack(level=-1)
sns.heatmap(gb, cmap='coolwarm')

In [None]:
fig = plt.figure(figsize = (16, 8))
sns.jointplot(x='price', y='number_of_reviews', data=data_filtered, height=16) # jointplot with seaborn

Maybe you can see a pattern in the points.

In [None]:
fig = plt.figure(figsize = (16, 8))
sns.lmplot(x='price', y='number_of_reviews', hue='room_type', data=data_filtered, height=16)