# Data manipulation in Pandas

In this notebook we'll introduce some of the basic concepts of the Pandas library for data manipulation. We'll also see simple chart examples using both Pandas and Seaborn.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [33]:
# The most important Pandas objects are the `Series` and the `DataFrame`
s = pd.Series({'x': 1, 'y': 2, 'z': 3})
s

x    1
y    2
z    3
dtype: int64

In [32]:
# We can slice a series in a similar way to a Python list
s[:2]

x    1
y    2
dtype: int64

In [35]:
# We can also select using the index
s['y']

2

In [36]:
s[['x', 'z']]

x    1
z    3
dtype: int64

A DataFrame is like a SAS dataset or a R dataframe (or tibble). Each column of a DataFrame is a Series.

In [46]:
df = pd.DataFrame({'x': np.random.randn(100), 
                   'y': np.random.randint(0, 100, 100),
                   'z': np.random.choice(list('abcde'), 100)})

In [42]:
df.head()

Unnamed: 0,x,y,z
0,-0.548199,8,c
1,-1.240177,40,c
2,2.248287,49,e
3,0.43641,36,c
4,-0.940199,93,a


In [43]:
df.dtypes

x    float64
y      int32
z     object
dtype: object

We are only going to scratch the surface of selecting from Series and DataFrames. For more info, see the Pandas documentation or (better) the [Python Data Science Handbook, chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html)

In [47]:
# Lets set an index on the DataFrame:
df.set_index('z', inplace=True)
df.head()

Unnamed: 0_level_0,x,y
z,Unnamed: 1_level_1,Unnamed: 2_level_1
b,0.290993,90
a,-0.914649,21
a,-0.221109,81
b,0.529836,76
e,1.03726,20


In [52]:
# Now we can select all records where the index = 'c':
df.loc['c'].head()

Unnamed: 0_level_0,x,y
z,Unnamed: 1_level_1,Unnamed: 2_level_1
c,-0.449746,83
c,-0.632314,91
c,-0.720477,85
c,-0.398999,97
c,0.911434,71


In [54]:
# .loc looks at the "explicit" index - i.e. the one we've defined. In contrast, .iloc looks at the "implicit" 
# index, which is just the row number. So to select rows 5 to 9:
df.iloc[5:10]

Unnamed: 0_level_0,x,y
z,Unnamed: 1_level_1,Unnamed: 2_level_1
d,-0.460142,98
c,-0.449746,83
c,-0.632314,91
e,-0.800619,72
d,0.021118,10


You may see some examples on the web that use the .ix method for indexing. Don't do this - it's deprecated.

We'll see some more indexing examples below

### Read our dataset - the Tableau superstore data!

In [None]:
try:
    superstore = pd.read_excel('data/superstore.xslx')
except FileNotFoundError:
    superstore = pd.read_excel('https://query.data.world/s/n2pyux2nabxy4c43zl3uugxsk5gt6v')

In [None]:
# Quick check: do we have the right number of rows?
assert len(superstore) == 51290

In [None]:
type(superstore)

In [None]:
superstore.head()

### A few basic data exploration tasks

In [None]:
# Basic summary of the table
superstore.info()

In [None]:
# How many countries do we have?
superstore['Country'].nunique()

In [None]:
# List of countries
superstore['Country'].value_counts()

In [None]:
# Total sales
superstore['Sales'].sum()

In [None]:
# Number of unique values for each column
superstore.nunique()

### Quick look at the distributions of numeric variables

In [None]:
plotdata = superstore.select_dtypes('number').drop('Postal Code', axis=1)
plotdata.dtypes

In [None]:
g = sns.FacetGrid(plotdata.melt(), col='variable', col_wrap=3, sharey=False, sharex=False)
g.map(sns.distplot, 'value', kde=False)
plt.show()

### Aggregations: [Split, apply, combine](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#GroupBy:-Split,-Apply,-Combine)

In [None]:
# Sales by segment
superstore.groupby('Segment')['Sales'].sum()

In [None]:
# Sales by market and segment
superstore.groupby(['Market', 'Segment'])['Sales'].sum()

In [None]:
# We can store the results of a query in an object:
sales_summary = superstore.groupby(['Market', 'Segment'])['Sales'].sum()
type(sales_summary)

In [None]:
# We now have an example of a 'multi-index' (or hierarchical index)
sales_summary.index

In [None]:
# What were the sales figures for Asia Pacific?
sales_summary['Asia Pacific']

In [None]:
# What were the sames for Consumer and Corporate segments in Europe?
sales_summary.loc[('Europe', ['Consumer', 'Corporate'])]

### Basic plotting: Sales by month

In [None]:
superstore.set_index('Order Date').resample('1M')['Sales'].sum().plot();

### A bar chart: Sales by market

This demonstrates how getting a chart to look just the way you want it can get very fiddly very quickly! That's where recent plotting libraries such as Seaborn, Plotly, Chartify etc can be better than the built-in Pandas plotting methods, or doing it from scratch in matplotlib. 

In [None]:
# First, choose our bar colour. By default, pandas uses a different colour for each bar - nasty!
colours = sns.color_palette('tab20') 
bar_colour = colours[0]

In [None]:
with sns.axes_style('darkgrid'):
    ax = superstore.groupby('Market')['Sales'].sum().sort_values().plot.barh(color=bar_colour)

    plt.title("Total sales by market (£000)")
    ax.yaxis.label.set_visible(False)
    vals = ax.get_xticks()
    ax.set_xticklabels(['{:,.0f}'.format(x/1000) for x in vals])

### Transformations: creating new columns etc

In [None]:
# Group by ... and create profit ratio column (profit / sales)
grouped = superstore.groupby('Segment')[['Sales', 'Profit']].sum()
grouped['profit_ratio'] = grouped['Profit'] / grouped['Sales']
grouped

Merges etc. Contrast SQL - which has to run on a database - with Pandas, which runs in memory. 
