# Numpy and Pandas 

## Goals

- Introduce the numpy and pandas libraries.

Learning objectives:

      - Numpy arrays and its mathmateical abilities
      
      - Importing data into pandas using CSVs
      
      - Slicing and filtering pandas dataframes
      
      - Statistics with pandas



- Two common Python libraries used for statistical analysis, data munging/wrangling/transformation, and other mathematical purpose.

- To put it simply, there is your connection to the data. Pandas is the most important tool because you'll spend the time and effort with it. 


### Numpy

Numpy has a wide ecosystem of functions and uses but for the purpose of this course we will focus on arrays aka numpy's version of a list

In [4]:
# Import library
import numpy as np

In [38]:
#Let's turn a list into an array

l = [3,2,6,7,9,1,2,-5]

array = np.array(l)

In [39]:
#Call it
array

array([ 3,  2,  6,  7,  9,  1,  2, -5])

<b> How arrays differ from lists

In [40]:
#Does this code work

l + 3

TypeError: can only concatenate list (not "int") to list

In [13]:
#What about this?
array + 3

4.2857142857142856

In [29]:
#Multiply l by 2 
l * 2

[3, 2, 6, 7, 9, 1, 2, 3, 2, 6, 7, 9, 1, 2]

In [30]:
#Multiply array by 2 
array * 2

array([ 6,  4, 12, 14, 18,  2,  4])

<b> Numpy array have mathetical abilities that lists don't have, which makes them easier to use<b>

In [33]:
#Mean value
array.mean()

4.2857142857142856

In [19]:
#Maximum value
array.max()

9

In [20]:
#Mininum value
array.min()

1

In [22]:
#Sum all values
array.sum()

30

In [23]:
#Find standard deviation
array.std()

2.8139593719417442

In [None]:
#What happens when you do this
dir(array)

Can also use numpy itself to call certain functions

In [24]:
#Median
np.median(array)

3.0

In [34]:
#Square
np.square(array)

array([ 9,  4, 36, 49, 81,  1,  4])

In [35]:
#Square root
np.sqrt(array)

array([ 1.73205081,  1.41421356,  2.44948974,  2.64575131,  3.        ,
        1.        ,  1.41421356])

In [None]:
# Absolute value
array.abs()

<b>Arrays can also be multi-dimensional<b>

In [47]:
#Make two dimensional numpy as with arange and reshape functions

np.arange(16)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [48]:
arr_2d = np.arange(16).reshape(4,4)
arr_2d

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

<b>Slicing two dimension array<b>

In [52]:
#Slice rows
arr_2d[:3]

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [54]:
#Slice columns 
arr_2d[:, 1:]

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15]])

In [55]:
#Slice both rows and columns
arr_2d[2: , :3]

array([[ 8,  9, 10],
       [12, 13, 14]])

In [56]:
#Slice specific value
arr_2d[1,2]

6

Fantastic numpy tutorial here: https://www.datacamp.com/community/tutorials/python-numpy-tutorial

## Pandas

From <u>[Mastering Pandas](https://www.packtpub.com/big-data-and-business-intelligence/mastering-pandas)</u>

    The pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008. Over the years, it has become the de-facto standard library for data analysis using Python. There's been great adoption of the tool, a large community behind it, (220+ contributors and 9000+ commits by 03/2014), rapid iteration, features, and enhancements continuously made.
    
    • It can process a variety of data sets in different formats: time series, tabular heterogeneous, and matrix data.
    • It facilitates loading/importing data from varied sources such as CSV and DB/SQL.
    It can handle a myriad of operations on data sets: subsetting, slicing,  ltering, merging, groupBy, re-ordering, and re-shaping.
    • It can deal with missing data according to rules defined by the user/ developer: ignore, convert to 0, and so on.
    • It can be used for parsing and munging (conversion) of data as well as modeling and statistical analysis.
    • It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.



In [57]:
#Import pandas library
import pandas as pd

In [68]:
#Turn python dictinonary into pandas data frame

data = {"feature_one" :[1,2,4,8,-3],
       "feature_two" : ["haight", "mission", "geary", "castro", " potrero"],
       "feature_three": [True, True, False, True, False]}
df = pd.DataFrame(data)

df

Unnamed: 0,feature_one,feature_three,feature_two
0,1,True,haight
1,2,True,mission
2,4,False,geary
3,8,True,castro
4,-3,False,potrero


In [69]:
#Call type on df
type(df)

pandas.core.frame.DataFrame

In [70]:
#Call feature_one column
f1 = df["feature_one"]
f1

0    1
1    2
2    4
3    8
4   -3
Name: feature_one, dtype: int64

In [71]:
#Call type on f1
type(f1)

pandas.core.series.Series

First dataset we will work with is the the drinks dataset

In [58]:
#File location of drinks dataset
path = "../data/drinks.csv"

drinks = pd.read_csv(path)

In [59]:
#Take a look at the data

drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [73]:
#Let's designate the country column as the index

drinks.set_index("country", inplace=True)

In [75]:
#Head is used to view first 5 rows. 5 is default but can be changed.
drinks.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,AS
Albania,89,132,54,4.9,EU
Algeria,25,0,14,0.7,AF
Andorra,245,138,312,12.4,EU
Angola,217,57,45,5.9,AF
Antigua & Barbuda,102,128,45,4.9,
Argentina,193,25,221,8.3,SA


In [76]:
#Tail is for last five rows
drinks.tail()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Venezuela,333,100,3,7.7,SA
Vietnam,111,2,1,2.0,AS
Yemen,6,0,0,0.1,AS
Zambia,32,19,4,2.5,AF
Zimbabwe,64,18,4,4.7,AF


In [77]:
#How many rows and columns are there in this dataset?
drinks.shape

(193, 5)

In [78]:
#Lets look at this some details of this dataset
drinks.info()

<class 'pandas.core.frame.DataFrame'>
Index: 193 entries, Afghanistan to Zimbabwe
Data columns (total 5 columns):
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       170 non-null object
dtypes: float64(1), int64(3), object(1)
memory usage: 9.0+ KB


What do you see here?

'''
CLASS: Pandas for Data Exploration, Analysis, and Visualization
`
WHO alcohol consumption data:
    article: http://fivethirtyeight.com/datalab/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/    
    original data: https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption
    files: drinks.csv (with additional 'continent' column)
'''

In [1]:
import pandas as pd
movie_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.item'
movie_cols = ['movie_id', 'title']
movies = pd.read_table(movie_url, sep='|', header=None, names=movie_cols, usecols=[0, 1])
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)
