Let's add the usual libraries

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt

In [None]:
import pandas as pd
import numpy as np

### Generate data

In [None]:
# random numbers
df = pd.DataFrame({'Data': np.random.normal(size=1000)})
# add outliers
outliers = pd.DataFrame({'Data': 1000+np.random.normal(size=10)})
df = df.append(outliers)
df.head()

In [None]:
df.tail()

In [None]:
# get statistics
df.describe()

In [None]:
#histogramm
df.hist()

## Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

In [None]:
#box plot
df.plot(kind='box')

## Some analysis with outliers

In [None]:
# calculate the average 
df['Data'].mean()

In [None]:
# calculate std
df['Data'].std()

## Eliminate outliers

In [None]:
#calculate quantile
df.quantile(.5)

In [None]:
quantile_99 = df.quantile(.99).Data
clean_df = df[df.Data < quantile_99]
clean_df.plot(kind='box')

In [None]:
#keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
clean_df = df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())]
clean_df.plot(kind='box')

In [None]:
print('Average = ', clean_df['Data'].mean())
print('Std = ', clean_df['Data'].std())

## Titanic data

For this series of examples, let's load up a dataset with the passengers from Titanic:

In [None]:
titanic = pd.read_excel("data/titanic.xls", "titanic")
titanic.head()

In [None]:
titanic.describe()

## Boxplots

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

In [None]:
titanic.boxplot(column='fare', by='pclass', grid=False, figsize = (8,5))

In [None]:
titanic[titanic['survived']==1]["fare"].hist()
titanic[titanic['survived']==0]["fare"].hist()


You can think of the box plot as viewing the distribution from above. The blue crosses are "outlier" points that occur outside the extreme quantiles.

In [None]:
# consider class 1
titanic[titanic['pclass']==1].boxplot(column='fare', grid=False, figsize = (8,5))

In [None]:
# calculate the average
print('Average value = ', titanic[titanic['pclass']==1]['fare'].mean())

In [None]:
# take a look at the outliers
titanic[(titanic['pclass']==1)&(titanic['fare']>400)]

In [None]:
# calculate the average without outliers
print('Average value = ', titanic[(titanic['pclass']==1)&(titanic['fare']<400)]['fare'].mean())