# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* DataTypes
    * Series
    * DataFrames
* Missing Data
* Operations
* Data Analysis
    * GroupBy
    * Merging,Joining,and Concatenating
    * Operations
* Data Input and Output

# Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [2]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Creating a Series from a **list**, using default indexing

In [3]:
pd.Series(data=my_list) 

0    10
1    20
2    30
dtype: int64

Creating a Series from a list, using define indices

In [4]:
pd.Series(data=my_list,index=labels) 

a    10
b    20
c    30
dtype: int64

In [5]:
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

Creating a Series from an **array**

In [6]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int64

In [7]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

Creating a Series from a **dict**. Value and indices are defined in the dictionary.

In [8]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

### Data in a Series

A pandas Series can hold a variety of object types:

In [9]:
pd.Series(data=labels) #Values are strings

0    a
1    b
2    c
dtype: object

In [10]:
pd.Series([sum,print,len]) #Values are functions

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

**Convert the type of value in a Series**

**series.astype(new_type)** outputs a Series with the same data converted to another object type

types could be **str, float, int32, int64, **etc

In [11]:
pd.Series(data=arr).astype(int)

0    10
1    20
2    30
dtype: int64

# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a several Series objects put together to share the same index. DataFrames have column headers in order to uniquely reference each Series object. Let's use pandas to explore this topic!

In [12]:
import pandas as pd
import numpy as np

In [13]:
from numpy.random import randn

In [14]:
index_list  = 'A B C D E'.split() #Splits the string into a list of letters
column_list = 'W X Y Z'.split()

df = pd.DataFrame(randn(5,4), index=index_list, columns=column_list)
df

Unnamed: 0,W,X,Y,Z
A,0.943895,-0.726214,-0.048286,-0.282032
B,-1.099737,-0.403837,-0.809824,-1.578647
C,-0.083766,2.11467,1.081837,0.326892
D,-1.635787,-0.811394,-1.203556,0.191282
E,-0.674214,-1.185172,-1.60552,0.005011


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

Grab a single column from the DataFrame. Notice the output is a Series object!

In [15]:
df['Z']

A   -0.282032
B   -1.578647
C    0.326892
D    0.191282
E    0.005011
Name: Z, dtype: float64

In [16]:
type(df['Z'])

pandas.core.series.Series

Sometimes you may see colums refernced using **df.column**. This is the old way of doing this. Moving forward you should use **df['column']** to reference a column in a DataFrame object.

In [18]:
df.Z

A   -0.282032
B   -1.578647
C    0.326892
D    0.191282
E    0.005011
Name: Z, dtype: float64

Select multiple columns. Notice we are using a list to index and the output is a DataFrame object.

In [19]:
df[['W','Z']] # Pass a list of column names

Unnamed: 0,W,Z
A,0.943895,-0.282032
B,-1.099737,-1.578647
C,-0.083766,0.326892
D,-1.635787,0.191282
E,-0.674214,0.005011


**Creating a new column:**

Set a Series object or array with the same number of items.

prior["product_id"].astype(str)

In [20]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,0.943895,-0.726214,-0.048286,-0.282032,0.895609
B,-1.099737,-0.403837,-0.809824,-1.578647,-1.909561
C,-0.083766,2.11467,1.081837,0.326892,0.998072
D,-1.635787,-0.811394,-1.203556,0.191282,-2.839343
E,-0.674214,-1.185172,-1.60552,0.005011,-2.279734


** Removing Rows and Columns**

**df.drop('row')** returns a DataFrame with a row removed. Notice original df object does not change.

**df.drop('column', axis=1)** removes a column from the DataFrame. If axis=0, a row will be removed. If you input a column name and set axis=0 it will fail. If you input a row index and set axis=1 it will fail. 

**df.drop('row', inplace=True)** removes a row from current df object. does not return anything.

In [21]:
df2 = df.drop('new',axis=1)
df2

Unnamed: 0,W,X,Y,Z
A,0.943895,-0.726214,-0.048286,-0.282032
B,-1.099737,-0.403837,-0.809824,-1.578647
C,-0.083766,2.11467,1.081837,0.326892
D,-1.635787,-0.811394,-1.203556,0.191282
E,-0.674214,-1.185172,-1.60552,0.005011


Original df object did not change.

In [22]:
df 

Unnamed: 0,W,X,Y,Z,new
A,0.943895,-0.726214,-0.048286,-0.282032,0.895609
B,-1.099737,-0.403837,-0.809824,-1.578647,-1.909561
C,-0.083766,2.11467,1.081837,0.326892,0.998072
D,-1.635787,-0.811394,-1.203556,0.191282,-2.839343
E,-0.674214,-1.185172,-1.60552,0.005011,-2.279734


In [23]:
df.drop('new', axis=1, inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,0.943895,-0.726214,-0.048286,-0.282032
B,-1.099737,-0.403837,-0.809824,-1.578647
C,-0.083766,2.11467,1.081837,0.326892
D,-1.635787,-0.811394,-1.203556,0.191282
E,-0.674214,-1.185172,-1.60552,0.005011


**Transpose** outputs a DataFame with columns and rows switched

In [24]:
df.transpose()

Unnamed: 0,A,B,C,D,E
W,0.943895,-1.099737,-0.083766,-1.635787,-0.674214
X,-0.726214,-0.403837,2.11467,-0.811394,-1.185172
Y,-0.048286,-0.809824,1.081837,-1.203556,-1.60552
Z,-0.282032,-1.578647,0.326892,0.191282,0.005011


###  Selecting Rows

**df.loc['row']**  selects a row by index. Notice output is a Series object with the columns as indices!

**df.iloc[number]** selects a row by position instead of index.

In [25]:
df['W']      #Select a column
df.loc['A']  #Select a row

W    0.943895
X   -0.726214
Y   -0.048286
Z   -0.282032
Name: A, dtype: float64

In [26]:
df.iloc[2]

W   -0.083766
X    2.114670
Y    1.081837
Z    0.326892
Name: C, dtype: float64

### Selecting subset of rows and columns

**df.loc['row','column']** output value at that location.

**df.loc[['row1','row2'],['column1','column2']]** if you provide a list of rows and columns, the output is a new DataFame that contains only the rows and columns selected, in the same order as the input lists.

In [27]:
df.loc['B','Y']

-0.8098244925473412

In [28]:
df.loc[['C','B'],['Y','W']]

Unnamed: 0,Y,W
C,1.081837,-0.083766
B,-0.809824,-1.099737


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation. Instead of selecting values in a DataFrame based on location, we will select values based on a condition.

In [29]:
df>0 # (Our condition) All values greater than zero. Returns a True/False DataFrame

Unnamed: 0,W,X,Y,Z
A,True,False,False,False
B,False,False,False,False
C,False,True,True,True
D,False,False,False,True
E,False,False,False,True


In [30]:
df[df>0] # Indexing using Our Condition

Unnamed: 0,W,X,Y,Z
A,0.943895,,,
B,,,,
C,,2.11467,1.081837,0.326892
D,,,,0.191282
E,,,,0.005011


###  
Conditioning based on values in column. 

In [31]:
df['W']>0 # (Our Condition) It returns a True/False Series object.

A     True
B    False
C    False
D    False
E    False
Name: W, dtype: bool

In [32]:
df[df['W']>0] # Indexing using Our Condition. 

Unnamed: 0,W,X,Y,Z
A,0.943895,-0.726214,-0.048286,-0.282032


Notice fewer rows. All rows that were False based on Our Condition are not present in the output DataFrame. It is possible to have a DataFrame with no rows.

Since the output is a DataFrame we can perform operations on it like any other DataFrame.

In [33]:
df[df['W']>0]['Y'] # Indexing using Our Condition and observing a single column.

A   -0.048286
Name: Y, dtype: float64

In [34]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,-0.048286,-0.726214


When applying multiple conditions you can use | and & with parenthesis:

In [35]:
df[(df['W']<0) & (df['Y'] > 0)][['Y','W']]

Unnamed: 0,Y,W
C,1.081837,-0.083766


### More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

**df.reset_index()** resets to default 0,1...n. A new column 'index' is generated containing the original indices.

In [36]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.943895,-0.726214,-0.048286,-0.282032
1,B,-1.099737,-0.403837,-0.809824,-1.578647
2,C,-0.083766,2.11467,1.081837,0.326892
3,D,-1.635787,-0.811394,-1.203556,0.191282
4,E,-0.674214,-1.185172,-1.60552,0.005011


Setting new indices on your DataFrame.
**df.set_index('column')** outputs a new DataFrame with an existing column as the indices in the DataFrame.
**df.set_index('column', inplace=True)** replaces indices with existing column with 

In [37]:
newind = 'CA NY WY OR CO'.split() # List of State names
df['States'] = newind # Add list to new column in DataFrame
df.set_index('States',inplace=True)
df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.943895,-0.726214,-0.048286,-0.282032
NY,-1.099737,-0.403837,-0.809824,-1.578647
WY,-0.083766,2.11467,1.081837,0.326892
OR,-1.635787,-0.811394,-1.203556,0.191282
CO,-0.674214,-1.185172,-1.60552,0.005011


**Rename rows or columns using df.rename()** 

**df.rename(columns = {"current_name" : "new_name"})** replaces the name of a column

**df.rename(index = {"current_name1" : "new_name1"}, inplace=True)** replaces the index of a row


In [38]:
df.rename(columns = {"Z" : "A"})

Unnamed: 0_level_0,W,X,Y,A
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,0.943895,-0.726214,-0.048286,-0.282032
NY,-1.099737,-0.403837,-0.809824,-1.578647
WY,-0.083766,2.11467,1.081837,0.326892
OR,-1.635787,-0.811394,-1.203556,0.191282
CO,-0.674214,-1.185172,-1.60552,0.005011


Note: It is important to understand the outputs of the functions. As you become more comfortable you will begin calling multiple functions in a row. Knowing if you are expecting to recieve a Series, DataFrame or specific value as an output is important. If you aren't careful, you may run into errors. 

# Great Job!