# Introduction to NumPy and Pandas

## What is NumPy?

NumPy is a package in Python used for powerful, scalable computation. NumPy has a special array object that allows it store data in a smaller memory size and have faster access/edit times than ordinary Python list objects. It also has a large number of useful mathematical functions for linear algebra, fourier transforms, etc.

## What is Pandas?
Pandas is a  package that makes it easy to use for data manipulation and analysis within Python through its use of two data structures: Series (1D) and DataFrames (2D). Pandas is fast and powerful because it is built on the NumPy arrays we learned about earlier. 

In [1]:
import numpy as np
import pandas as pd

## Pandas Series

A **Series** is a one-dimensional NumPy array with indices. The data in the array can be of any type (integers, strings, dictionaries, etc.), and the indices should be unique values. In most cases, indices are strings, integers, or dates. Series are used to build DataFrames, which we'll talk about very soon.

### Creating a Series
The simplest way to make a series is with a list, however there are other ways that you may look up if you would like.

In [9]:
list1 = ['apple', 'banana', 'watermelon', 'cantelope']
s1 = pd.Series(list1)
s1

0         apple
1        banana
2    watermelon
3     cantelope
dtype: object

Notice that the data type of the series is printed. The data type of this series is 'object' which is the data type of all non-numeric Series.

### Getting the index of a Series

In [3]:
s1.index

RangeIndex(start=0, stop=4, step=1)

### Changing the index of a Series

In [10]:
s1.index = ['one thing','two things','three things','four things']
s1

one thing            apple
two things          banana
three things    watermelon
four things      cantelope
dtype: object

### Applying an operation to every item in a Series

In [14]:
# A Series of strings to manipulate
s2 = pd.Series(['a','bkj','chf','d','ejh','f','gh'])

# Use a lambda (anonamous) function inside of the .apply() method to change all of the characters in s2 to upper case
s2 = s2.apply(lambda tmp: tmp.upper()) # apply works on every function
s2

0      A
1    BKJ
2    CHF
3      D
4    EJH
5      F
6     GH
dtype: object

### Sorting Values

In [19]:
# A Series of numbers to sort
s3 = pd.Series([5,8,6,2,4,1,3,9,0,7])

# s3 sorted
s3.sort_values(ascending = False)

7    9
1    8
9    7
2    6
0    5
4    4
6    3
3    2
5    1
8    0
dtype: int64

### Getting the counts of distinct values in a Series

In [22]:
# A series of fruit
s4 = pd.Series(['apple','orange','pear','apple','orange','apple'])

# The counts of all of the distinct values in s4
s4.value_counts()

apple     3
orange    2
pear      1
dtype: int64

All of the above can be used on columns of a DataFrame, as all of the columns in a DataFrame are Series…

## DataFrames
A DataFrame is a 2-Dimensional Pandas data structure with labeled rows and columns. Each row shares a common index value. Each column of a DataFrame is a Series itself. There are many, many ways of creating a DataFrame. We'll go over one way, and you'll learn more as you use Pandas. 

There are many ways to construct a DataFrame. This image illustrates a few ways that may be useful: ![](pandas-dataframe-shadow.png)

Let's construct a simple DataFrame using a dictionary of Series…

In [27]:
makes = pd.Series(['Ford','Audi','Toyota','Fiat'])
models = pd.Series(['GT','R8','Camry','Panda'])

cars = pd.DataFrame({'Car Make':makes, 'Car Model':models})
cars

Unnamed: 0,Car Make,Car Model
0,Ford,GT
1,Audi,R8
2,Toyota,Camry
3,Fiat,Panda


### Getting the column names of a DataFrame

In [25]:
cars.columns

Index(['Car Make', 'Car Model'], dtype='object')

### Changing column names

In [30]:
cars.columns = ['Make','Model']
print(cars.columns)
cars

Index(['Make', 'Model'], dtype='object')


Unnamed: 0,Make,Model
0,Ford,GT
1,Audi,R8
2,Toyota,Camry
3,Fiat,Panda


### Adding columns to a DataFrame

Notice that making a column with a series and with a list yield the same outcome. This is because the new column is converted into a series automatically when added to the DataFrame, no matter the form.

In [37]:
# adding a price column from a Series
cars['Price'] = pd.Series([139995, 164900, 23495, 23490])
# adding a price column from a list
cars['Quantity Sold'] = [100, 150, 350, 250] #use the series instead of a basic list for the columns

cars

Unnamed: 0,Make,Model,Price,Quantity Sold
0,Ford,GT,139995,100
1,Audi,R8,164900,150
2,Toyota,Camry,23495,350
3,Fiat,Panda,23490,250


### Engineering a new column from other columns

This is called Feature Engineering

In [38]:
cars['Revenue'] = cars['Price'] * cars['Quantity Sold'] #lets you make a column that does the same operation for each row

cars

Unnamed: 0,Make,Model,Price,Quantity Sold,Revenue
0,Ford,GT,139995,100,13999500
1,Audi,R8,164900,150,24735000
2,Toyota,Camry,23495,350,8223250
3,Fiat,Panda,23490,250,5872500


### Removing columns

The **.drop()** method will return a DataFrame without the specified rows or columns. 

The **axis** argument is used to indicate whether or not to drop rows or columns and the **inplace** argument is used to indicate whether the function should simply return a DataFrame, or edit the original one directly. These two arguments appear in many Pandas methods and so you will likely come across them again.

For this example we will first create a column of NaN's to later delete as an example…

In [42]:
cars['toDelete'] = np.nan
cars

Unnamed: 0,Make,Model,Price,Quantity Sold,Revenue,toDelete
0,Ford,GT,139995,100,13999500,
1,Audi,R8,164900,150,24735000,
2,Toyota,Camry,23495,350,8223250,
3,Fiat,Panda,23490,250,5872500,


Now we will delete the 'toDelete' column…

In [43]:
cars.drop(['toDelete'], axis = 1, inplace = True) #without inplace=True then this function will just return a data frame
cars                                              #actually overwriting the variable in memory

Unnamed: 0,Make,Model,Price,Quantity Sold,Revenue
0,Ford,GT,139995,100,13999500
1,Audi,R8,164900,150,24735000
2,Toyota,Camry,23495,350,8223250
3,Fiat,Panda,23490,250,5872500


### Setting the index of a DataFrame

You won't always want to keep the default index of your DataFrame, so here we will make the index the 'Make' column from the DataFrame…

In [44]:
cars.set_index('Make', inplace = True) # this changes what the base index that we search on is instead of 0,1,2,3 etc.
cars

Unnamed: 0_level_0,Model,Price,Quantity Sold,Revenue
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ford,GT,139995,100,13999500
Audi,R8,164900,150,24735000
Toyota,Camry,23495,350,8223250
Fiat,Panda,23490,250,5872500


## Indexing DataFrames

There are multiple ways to index DataFrames, but first you must learn some syntax…

* Brackets always denote indexing [rows,columns]
* a colon is used for slicing. 
    * ex: [3:,1] would select the fourth through last rows, and the first column
    * ex: [3,:] would select the fourth row, and the all of the columns

### Selecting columns from the DataFrame

In [56]:
# Selecting the Make column
print(cars['Model']) #only will print index column and the column that you asked for in the print statement 

# Selecting the Make and Model columns
cars[['Model','Revenue']] #this will do the same as the print statement above

Make
Ford         GT
Audi         R8
Toyota    Camry
Fiat      Panda
Name: Model, dtype: object


Unnamed: 0_level_0,Model,Revenue
Make,Unnamed: 1_level_1,Unnamed: 2_level_1
Ford,GT,13999500
Audi,R8,24735000
Toyota,Camry,8223250
Fiat,Panda,5872500


### Selecting rows and columns from the DataFrame using .loc and .iloc

**.loc[]** and **.iloc[]** are used for location indexing. **.loc[]** is used to index by column and index name, while **.iloc[]** is used to index by integer location. Yes, these can often be used interchangeably, but there are times where one or the other is more suitable.

In [61]:
cars["Revenue"].apply(lambda x: "$" + str(x) )

# Selecting the Ford index and the Revenue Column using .loc()
print(cars.loc['Ford','Revenue']) 

# Selecting the Ford index and the Revenue Column using .iloc()
cars.iloc[0,3] 

13999500


13999500

Notice how if no column is designated between the brackets, all of the columns are selected…

In [58]:
# Selecting the Ford and Toyota indexes and all of the columns using .loc()
print(cars.loc[['Ford','Toyota']])

# Selecting the Ford and Toyota indexes and all of the columns using .loc()
cars.iloc[[0,2]]

        Model   Price  Quantity Sold   Revenue
Make                                          
Ford       GT  139995            100  13999500
Toyota  Camry   23495            350   8223250


Unnamed: 0_level_0,Model,Price,Quantity Sold,Revenue
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ford,GT,139995,100,13999500
Toyota,Camry,23495,350,8223250


In [59]:
# Selecting the Audi row using .loc()
print(cars.loc['Audi'])

# Selecting the Audi row using .iloc()
cars.iloc[1]

Model                  R8
Price              164900
Quantity Sold         150
Revenue          24735000
Name: Audi, dtype: object


Model                  R8
Price              164900
Quantity Sold         150
Revenue          24735000
Name: Audi, dtype: object

Notice how when selecting a single row, .loc[] and .iloc[] return the row as a Series.

In [62]:
cars

Unnamed: 0_level_0,Model,Price,Quantity Sold,Revenue
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ford,GT,139995,100,13999500
Audi,R8,164900,150,24735000
Toyota,Camry,23495,350,8223250
Fiat,Panda,23490,250,5872500


In [64]:
# Selecting the second through last indices, and the first two columns using .loc()
print(cars.loc['Audi':,:'Price'])

# Selecting the second through last indices, and the first two columns using .iloc()
cars.iloc[1:,:2]

        Model   Price
Make                 
Audi       R8  164900
Toyota  Camry   23495
Fiat    Panda   23490


Unnamed: 0_level_0,Model,Price
Make,Unnamed: 1_level_1,Unnamed: 2_level_1
Audi,R8,164900
Toyota,Camry,23495
Fiat,Panda,23490


### Conditional Indexing

Indexing with truth statements.

In [67]:
# Selecting all of the rows where Price is greater then $100,000
print(cars.loc[cars['Price'] > 100000])
print()

# Selecting all of the rows where Price is greater then $100,000 and Quantity_Sold is less than 125
cars.loc[(cars['Price'] > 100000) & (cars['Quantity Sold'] < 125)]

     Model   Price  Quantity Sold   Revenue
Make                                       
Ford    GT  139995            100  13999500
Audi    R8  164900            150  24735000



Unnamed: 0_level_0,Model,Price,Quantity Sold,Revenue
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ford,GT,139995,100,13999500


## Adding Rows to a DataFrame

Unlike for adding new columns, we use .loc[] for adding rows.

In [68]:
# adding a row for BMW
cars.loc['BMW'] = ['M5',100000,180,18000000]
cars

Unnamed: 0_level_0,Model,Price,Quantity Sold,Revenue
Make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ford,GT,139995,100,13999500
Audi,R8,164900,150,24735000
Toyota,Camry,23495,350,8223250
Fiat,Panda,23490,250,5872500
BMW,M5,100000,180,18000000


## Importing Data

Pandas has many functions for importing data from different file types such as **pd.read_csv()** and **pd.read_excel()**.

For this example we will use **pd.read_csv()** to read in the iris dataset as a DataFrame…

In [72]:
iris = pd.read_csv('iris.csv', sep = ',') # the 'sep' argument denotes what the items in the file are separated by

iris.head()
#iris.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


There are 150 rows of data in this dataset so the **.head()** method was used to print out only the first five rows of data.

## Groupby

The **.groupby()** method will group categorical data together and return a dataframe containing an aggregate of your choosing of all of the numeric columns in the DateFrame.

In [73]:
# Grouping iris by the class feature with a mean aggregate
print(iris.groupby('class').mean())

# Grouping iris by the class feature with a sum aggregate
iris.groupby(['class']).sum()

                 sepal_length  sepal_width  petal_length  petal_width
class                                                                
Iris-setosa             5.006        3.418         1.464        0.244
Iris-versicolor         5.936        2.770         4.260        1.326
Iris-virginica          6.588        2.974         5.552        2.026


Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,250.3,170.9,73.2,12.2
Iris-versicolor,296.8,138.5,213.0,66.3
Iris-virginica,329.4,148.7,277.6,101.3


## Information about your data

### The .info() method

The **.info()** method gives you a few different pieces of information:

* The number of rows
* The range of the index
* The number of columnns
* Each column name with the number of non-null entries in that column and it's data type
* The number of columns of each data type
* The amount of memory that the Data Frame takes up

In [74]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
class           150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


### The .describe() method

This method will return different information depending on the data types of the columns of the DateFrame being described:

**Categorical information**: 
        * Count
        * Unique
        * Top
        * Frequency

**Numerical information**:
        * Count
        * Mean
        * Standard Deviation
        * Minimum
        * 25th Percentile
        * 50th Percentile
        * 75th Percentile
        * Maximum

In [75]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# Exercises

Here is a link to a github repository with lots of good Pandas exercises: https://github.com/guipsamora/pandas_exercises

You Should be able to do at least the first few directories in this repository after going through this tutorial, and more if you do a little googling.