![Pandas Logo](https://pandas.pydata.org/static/img/pandas.svg)

# <span style="color:blue">Data Manipulation with Pandas - Review </span>

This notebook serves as a refresher of [Santiago Casas](https://www.cosmostat.org/people/santiago-casas)'s own ([workshop](https://github.com/santiagocasas/Tutorials/tree/pysci)).

Let's review some of the main concepts discussed there

- Main topics
    - Series
    - DataFrames
    - Indexing, Selecting, Filtering
    - Drop columns
    - Handling missing Data
    

## Let's import some python packages

In [41]:
# Importing modules
%matplotlib inline
%load_ext lab_black

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_context("notebook")
import matplotlib

matplotlib.rc("text", usetex=False)

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


# Series
A _Series_ is a one-dimensional array-like object containing an array of data and an associated array of __data labels__.
One can use any NumPy data type to assign to the _Series_

Creating a Series:

In [2]:
np.random.seed(1)

np.random.random(10)

array([4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01,
       1.46755891e-01, 9.23385948e-02, 1.86260211e-01, 3.45560727e-01,
       3.96767474e-01, 5.38816734e-01])

In [3]:
series_1 = pd.Series(np.random.random(10))
series_1

0    0.419195
1    0.685220
2    0.204452
3    0.878117
4    0.027388
5    0.670468
6    0.417305
7    0.558690
8    0.140387
9    0.198101
dtype: float64

One can get a NumPy array from the Series, by typing:

In [4]:
series_1.values

array([0.41919451, 0.6852195 , 0.20445225, 0.87811744, 0.02738759,
       0.67046751, 0.4173048 , 0.55868983, 0.14038694, 0.19810149])

# Reindexing

One can also get the indices of each element, by typing:

In [5]:
series_1.index.values

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

One can also have a custom set of indices:

In [6]:
# import string
# alphabet = string.lowercase
# alphabet = np.array([x for x in alphabet])[0:10]
# alphabet

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
alphabet

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [7]:
series_2 = pd.Series(np.random.random(len(alphabet)), index=alphabet)
series_2

a    0.800745
b    0.968262
c    0.313424
d    0.692323
e    0.876389
f    0.894607
g    0.085044
h    0.039055
i    0.169830
j    0.878143
dtype: float64

One can select only a subsample of the _Series_

In [8]:
series_1[[0, 1, 2]]

0    0.419195
1    0.685220
2    0.204452
dtype: float64

In [9]:
series_1[[1,3,4]]

1    0.685220
3    0.878117
4    0.027388
dtype: float64

In [10]:
series_2[['a','d','j']]

a    0.800745
d    0.692323
j    0.878143
dtype: float64

# Arithmetic and function Mapping

You can also perform numerical expressions

In [11]:
series_1**2

0    0.175724
1    0.469526
2    0.041801
3    0.771090
4    0.000750
5    0.449527
6    0.174143
7    0.312134
8    0.019708
9    0.039244
dtype: float64

In [12]:
series_1[1]**2

0.4695257637239847

Or find values greater than some value '__x__'

In [13]:
x = 0.5
series_1[(series_1 >= x) & (series_1 < 0.8)]

1    0.685220
5    0.670468
7    0.558690
dtype: float64

You can apply functions to a column, and save it as a _new_ Series

In [14]:
import sys
def exponentials(arr, basis=10.):
    """
    Uses the array `arr` as the exponents for `basis`
    
    Parameters
    ----------
    arr: numpy array, list, pandas Series; shape (N,)
        array to be used as exponents of `basis`
    
    power: int or float, optional (default = 10)
        number used as the basis
    
    Returns
    -------
    exp_arr: numpy array or list, shape (N,)
        array of values for `basis`**`arr`
    """
    if isinstance(arr, list):
        exp_arr = [basis**x for x in arr]
        return exp_arr        
    elif isinstance(arr, np.ndarray) or isinstance(arr, pd.core.series.Series):
        exp_arr = basis**arr
        return exp_arr
    else:
        cmd = ">>>> `arr` is not a list nor a numpy array"
        cmd +="\n>>>> Please give the correct type of object"
        print(cmd)
        sys.exit(1)

In [15]:
exponentials(series_1[(series_1 >= x) & (series_1 > 0.6)]).values

array([4.84417139, 7.55296438, 4.68238921])

You can also __create__ a _Series_ using a _dictionary_ (we talked about these on __Week 4__)

In [16]:
labels_arr = ['foo', 'bar', 'baz']
data_arr   = [100, 200, 300]
dict_1     = dict(zip(labels_arr, data_arr))
dict_1

{'foo': 100, 'bar': 200, 'baz': 300}

In [17]:
series_3 = pd.Series(dict_1)
series_3

foo    100
bar    200
baz    300
dtype: int64

# Handling Missing Data

One of the most useful features of pandas is that it __can handle missing data__ quite easily:

In [18]:
index = ['foo', 'bar', 'baz', 'qux']
series_4 = pd.Series(dict_1, index=index)
series_4

foo    100.0
bar    200.0
baz    300.0
qux      NaN
dtype: float64

In [19]:
pd.isnull(series_4)

foo    False
bar    False
baz    False
qux     True
dtype: bool

In [20]:
series_3

foo    100
bar    200
baz    300
dtype: int64

In [21]:
series_3 + series_4

bar    400.0
baz    600.0
foo    200.0
qux      NaN
dtype: float64

So using a Series is powerful, but __DataFrames__ are probably what gets used the most since it represents a _tabular data structure_ containing an ordered collection of __columns__ and __rows__.

# DataFrames

A DataFrame is a "tabular data structure" containing an _ordered collection of columns_. Each column can a have a __different__ data type.

Row and column operations are treated roughly symmetrically.
One can obtain a DataFrame from a normal dictionary, or by reading a file with columns and rows.

Creating a DataFrame

In [22]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'popu' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1

Unnamed: 0,state,year,popu
0,VA,2012,5.0
1,VA,2013,5.1
2,VA,2014,5.2
3,MD,2014,4.0
4,MD,2015,4.1


This DataFrame has 4 rows and 3 columns by the name "_pop_", "_state_", and "_year_".

The way to __access__ a DataFrame is quite similar to that of accessing a _Series_.<br>
To access a __column__, one writes the name of the `column`, as in the following example:

In [23]:
df_1['popu']

0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64

In [24]:
df_1.popu

0    5.0
1    5.1
2    5.2
3    4.0
4    4.1
Name: popu, dtype: float64

One can also handle __missing data__ with DataFrames.
Like Series, columns that are not present in the data are NaNs:

In [25]:
df_2 = pd.DataFrame(data_1, columns=['year', 'state', 'popu', 'unempl'])
df_2

Unnamed: 0,year,state,popu,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,
2,2014,VA,5.2,
3,2014,MD,4.0,
4,2015,MD,4.1,


In [26]:
df_2['state']

0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

One can __retrieve a row__ by:

In [27]:
df_2.iloc[1:4]

Unnamed: 0,year,state,popu,unempl
1,2013,VA,5.1,
2,2014,VA,5.2,
3,2014,MD,4.0,


Editing a DataFrame is quite easy to do. One can _assign_ a Series to a column of the DataFrame. If the Series is a list or an array, __the length must match the DataFrame__.

In [28]:
unempl = pd.Series([1.0, 2.0, 10.], index=[1,3,5])
unempl

1     1.0
3     2.0
5    10.0
dtype: float64

In [29]:
df_2['unempl'] = unempl
df_2

Unnamed: 0,year,state,popu,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,1.0
2,2014,VA,5.2,
3,2014,MD,4.0,2.0
4,2015,MD,4.1,


In [30]:
df_2.unempl.isnull()

0     True
1    False
2     True
3    False
4     True
Name: unempl, dtype: bool

You can also __transpose__ a DataFrame, i.e. switch rows by columns, and columns by rows

In [31]:
df_2.T

Unnamed: 0,0,1,2,3,4
year,2012,2013,2014,2014,2015
state,VA,VA,VA,MD,MD
popu,5,5.1,5.2,4,4.1
unempl,,1,,2,


Now, let's say you want to show __only the 'year' and 'popu' columns__.
You can do it by:

In [32]:
df_2

Unnamed: 0,year,state,popu,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,1.0
2,2014,VA,5.2,
3,2014,MD,4.0,2.0
4,2015,MD,4.1,


In [33]:
df_2[['year', 'unempl']]

Unnamed: 0,year,unempl
0,2012,
1,2013,1.0
2,2014,
3,2014,2.0
4,2015,


# Dropping Entries

Let's say you only need a subsample of the table that you have, and you need to __drop__ a column from the DataFrame.
You can do that by using the '_drop_' option:

In [34]:
df_2

Unnamed: 0,year,state,popu,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,1.0
2,2014,VA,5.2,
3,2014,MD,4.0,2.0
4,2015,MD,4.1,


In [35]:
df_3 = df_2.drop('unempl', axis=1)
df_3

df_2.drop('unempl', axis=1)

Unnamed: 0,year,state,popu
0,2012,VA,5.0
1,2013,VA,5.1
2,2014,VA,5.2
3,2014,MD,4.0
4,2015,MD,4.1


In [36]:
df_2

Unnamed: 0,year,state,popu,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,1.0
2,2014,VA,5.2,
3,2014,MD,4.0,2.0
4,2015,MD,4.1,


You can also __drop certain rows__:

In [37]:
df_2

Unnamed: 0,year,state,popu,unempl
0,2012,VA,5.0,
1,2013,VA,5.1,1.0
2,2014,VA,5.2,
3,2014,MD,4.0,2.0
4,2015,MD,4.1,


In [38]:
df_4 = df_2.drop([1,2])
df_4

Unnamed: 0,year,state,popu,unempl
0,2012,VA,5.0,
3,2014,MD,4.0,2.0
4,2015,MD,4.1,


__Look at this carefully__! The DataFrame _preserved_ the same indices as for __df_2__.

If you can to __reset__ the indices, you can do that by:

In [39]:
df_4.reset_index(inplace=True)
df_4

Unnamed: 0,index,year,state,popu,unempl
0,0,2012,VA,5.0,
1,3,2014,MD,4.0,2.0
2,4,2015,MD,4.1,


# <span style="color:green">Future of Pandas </span>
Pandas is a __great__ for handling data, especially comma-delimited or space-separated data. Pandas is also compatible with many other packages, like __seaborn__, __astropy__, NumPy, etc.

# <span style="color:blue">Resources </span>
- [12 Useful Pandas Techniques in Python for Data Manipulation](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/)
- [Datacamp Pandas Tutorial](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
- [Top 8 resources for learning data analysis with pandas](http://www.dataschool.io/best-python-pandas-resources/)