# More Advanced Pandas & Intro to Matplotlib

In [1]:
#load libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Reviewing .iloc/.loc/.ix 

*.iloc* = iloc works based on integer positioning( EXCLUDES stop range value)<br> </br>
*.loc* = loc works on named indices (includes the stop range value ) <br> </br>
*.ix* = ix works on mixed types (includes the stop range value )<br> </br>

check: https://www.youtube.com/watch?v=xvpNA7bC8cs&t=1166s

# Working with missing data

Pandas uses 'NaN' (not a number) as its default missing value argument. Missing data is considered values in a dataset that is not present. We must explicitly tell Pandas what we are calling a missing value.

In [2]:
df2 = pd.DataFrame({'a':[1,2,3,4,5],'b':[10,'NaN',30,40,50],'c':[100,200,'NaN','NaN',500],'d':['NaN','NaN','NaN',4000,5000]})

In [3]:
# Count the number of missing values in Each row
df2['Missing_Values'] =df2.isnull().sum(axis=1)
print(df2.isnull()) 

##why didn't this work???

       a      b      c      d Missing_Values
0  False  False  False  False          False
1  False  False  False  False          False
2  False  False  False  False          False
3  False  False  False  False          False
4  False  False  False  False          False


In [4]:
##null needs to be specified!
# We can use the numpy to do so

df2 = df2.replace('NaN',np.nan)
print(df2.isnull())

       a      b      c      d Missing_Values
0  False  False  False   True          False
1  False   True  False   True          False
2  False  False   True   True          False
3  False  False   True  False          False
4  False  False  False  False          False


In [5]:
# create a new column and sum the number of missing values across the row
df2['Missing_Values'] =df2.isnull().sum(axis=1)
print(df2)

   a     b      c       d  Missing_Values
0  1  10.0  100.0     NaN               1
1  2   NaN  200.0     NaN               2
2  3  30.0    NaN     NaN               2
3  4  40.0    NaN  4000.0               1
4  5  50.0  500.0  5000.0               0


# More Advanced Data Wrangling in Pandas


PANDAS provides various facilities for easily combining together Series, DataFrame, etc objects with various kinds of set logic for the indexes and relational functionality in the case of join / merge-type operations.

# <font color='red'> Concatenating Dataframes </font>


**Concat**:  concat function in Pandas is used to append either columns or rows from one DataFrame to another. When we concatenate our DataFrames we simply add them to each other - stacking them either vertically or side by side.


### <font color='blue'> Concat Example </font> 

In [6]:
raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])
df_a

Unnamed: 0,subject_id,first_name,last_name
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
3,4,Alice,Aoni
4,5,Ayoung,Atiches


In [7]:
raw_data2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
df_b = pd.DataFrame(raw_data2, columns = ['subject_id', 'first_name', 'last_name'])
df_b

Unnamed: 0,subject_id,first_name,last_name
0,4,Billy,Bonder
1,5,Brian,Black
2,6,Bran,Balwner
3,7,Bryce,Brice
4,8,Betty,Btisan


In [8]:
raw_data3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
df_n = pd.DataFrame(raw_data3, columns = ['subject_id','test_id'])
df_n

Unnamed: 0,subject_id,test_id
0,1,51
1,2,15
2,3,15
3,4,61
4,5,16
5,7,14
6,8,15
7,9,1
8,10,61
9,11,16


In [9]:
# concatenate 2 dataframes along rows
df_byrow = pd.concat([df_a, df_b])

df_byrow

Unnamed: 0,subject_id,first_name,last_name
0,1,Alex,Anderson
1,2,Amy,Ackerman
2,3,Allen,Ali
3,4,Alice,Aoni
4,5,Ayoung,Atiches
0,4,Billy,Bonder
1,5,Brian,Black
2,6,Bran,Balwner
3,7,Bryce,Brice
4,8,Betty,Btisan


In [10]:
# join two datarames along columns
df_bycol = pd.concat([df_a, df_b], axis=1)
df_bycol

Unnamed: 0,subject_id,first_name,last_name,subject_id.1,first_name.1,last_name.1
0,1,Alex,Anderson,4,Billy,Bonder
1,2,Amy,Ackerman,5,Brian,Black
2,3,Allen,Ali,6,Bran,Balwner
3,4,Alice,Aoni,7,Bryce,Brice
4,5,Ayoung,Atiches,8,Betty,Btisan


## <font color='red'> Merging Dataframes </font>

**Merge**:  Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called “joining”. The columns containing the common values are called “join key(s)”. Joining DataFrames in this way is often useful when one DataFrame is a “lookup table” containing additional data that we want to include in the other.

<img src='initialiDF.jpg'>

<img src="merge_PD.jpg">

In [29]:
# merge example
pd.merge(df_a, df_b,left_on='subject_id',right_on='subject_id',how='right')

Unnamed: 0,subject_id,first_name_x,last_name_x,first_name_y,last_name_y
0,4,Alice,Aoni,Billy,Bonder
1,5,Ayoung,Atiches,Brian,Black
2,6,,,Bran,Balwner
3,7,,,Bryce,Brice
4,8,,,Betty,Btisan


# Groupby

1.  Groupby  splits the data into different groups depending on a variable of your choice
2. A GroupBy object .groups variable is a dictionary whose keys are the computed unique groups with corresponding values being the axis labels belonging to each group. 
3. This groups variable allows for access to the object similar to what we've seen before with Pandas

In [11]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry',index_col='country')

In [12]:
#inspect the dataframe
drinks.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
Algeria,25,0,14,0.7,Africa
Andorra,245,138,312,12.4,Europe
Angola,217,57,45,5.9,Africa


In [13]:
# using .groupby function to see the beer serving mean by continent
drinks.groupby('continent').beer_servings.mean()

continent
Africa            61.471698
Asia              37.045455
Europe           193.777778
North America    145.434783
Oceania           89.687500
South America    175.083333
Name: beer_servings, dtype: float64

In [14]:
# max number of beer servings by continent
drinks.groupby('continent').beer_servings.max()

continent
Africa           376
Asia             247
Europe           361
North America    285
Oceania          306
South America    333
Name: beer_servings, dtype: int64

In [15]:
#Aggregate findings
drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max'])

Unnamed: 0_level_0,count,min,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Africa,53,0,376
Asia,44,0,247
Europe,45,0,361
North America,23,1,285
Oceania,16,0,306
South America,12,93,333


In [23]:
##Accessing groupby objects



# MatplotLib 

Matplotlib is a 2D plotting library in python that was designed to closely resemble Matlab;

1. Allows for access to object properties that can be modified

**checkout some examples : [https://matplotlib.org/examples/**

<img src='figure_axes_axis_labeled.png'>

*Matplotlib* works using a hierarchy of containers that are all adjustable:
   1. **Figure** is the topmost container: it is the entire page for your plot
      -- it can contain multiple plots and axes
   2. Plotting is mostly done on the Axes container via subplot:
      -- Each Axes object has access to many other plot controls
 
 *So in general we create plots by specifying each of the containers (topmost first) then create the graphs* <br> </br>***(not necessary for basic plots)***

In [None]:
# This code will embed our plots inside this notebook 
%matplotlib inline

#load the specific sub library we are interested in
import matplotlib.pyplot as plt

In [None]:
#Basic plot Example 1
#Create values incremented by .25 by using NumPy
x2 = np.arange(0,1.25,.25) 
plt.plot(x2,x2**2)

In [None]:
#Basic plot Example 2 with more than 1 line
#Create some new fake data to graph
X = np.linspace(-np.pi,np.pi,256,endpoint=True)
C = np.cos(X)
S = np.sin(X)

#Create a basic plot
plt.plot(X,S)
#plt.show()

#Once the plot object is open, you can append a new line
plt.plot(X,C)

In [None]:
# Another Example of Updating the properties of a line object;
# With new data!
x = np.arange(0,1.0,0.01) # values 0 to 1 in steps of .01
y1 = np.sin(2*np.pi*x)
y2 = np.sin(4*np.pi*x)

# note that the two lines work in one call because of same x
lines = plt.plot(x, y1, x, y2)

# The setp() function operates on a single instance or a list of instances.
# changes the list of the bplot
plt.setp(lines,linewidth=2,color='b')

### Adjusting the x and y labels
> <div style="text-align: right"> *Use  plt.xticks & plt.yticks*</div>

In [None]:
#Example changing the axis values (note equal lengths)
x = [1, 2, 3, 4]
y = [1, 4, 9, 6]
z = [1, 8,16,24]

# create a list of labels
xlabs = ['Mark','Himanshu','Samson','Danny']
ylabs = ['Man','Hombre','Bro','Muchacho']
# set the plot up
plt.plot(x,y,'go')
#add the labels
plt.xticks(x,xlabs,rotation='vertical')
# ylab adjusted
plt.yticks(y,ylabs)

## Add sub-plots to a figure

To add a subplot we must pass several arguments to the pd.subplot() call:

using:
<div style="text-align: left">***plt.subplot(row,col,plot_num)***</div>
<div style="text-align: center">What does that mean??</div>
<div style="text-align: center">
1. rows (**1**)
2. Number of columns (**1**) 
3. plot number (**1**)
</div> 
<br>  </br>
<br>  </br>
<div style="text=align: right">*let's look at this more closely* </div>


<img src='plot_subplot-grid_1.png'>

In [None]:
#Adjust the figures and add new plots to a figure

# The first figure
plt.figure(1) 

## REMINDER USAGE: plt.subplot(nrows, ncols, plot_number)

# The first subplot of figure 1
plt.subplot(211)
#Our first plot
plt.plot(x,y)
# The SECOND subplot of figure1
plt.subplot(212) # <---
# Our SECOND plot
plt.plot(x,z,color='r')

# Add a title above the first subplot
plt.figure(1)
plt.subplot(211) #<---
plt.title('WOAH 2 seperate graphs!!')

# Check out the MatplotLib Docs for more advanced graphs (ie histograms etc)

# Pandas and Matplotlib together

#### Pandas dataframes have access to the .plot function for quick plotting

In [None]:
data = drinks.groupby('continent').mean()
data.plot(kind='barh')