# Introduction to NumPy & Pandas

# NumPy
NumPy is a python library used for working with large, multi-dimentional arrays. It also has a large collection of mathmatical functions to operate on these arrays. Some uses of NumPy include:
1. Scientific domains of statistical computing, image processing, and mathematical analysis
2. Data Science for ETL, Exploratory Data Analysis, and Modeling
3. Machine Learning
4. Data Visualization

Let's import numpy as np.

In [38]:
import numpy as np

## NumPy Arrays
Let's create an array using the function np.array, and include 3 values in square brackets. You might find the Numpy arrays similar to Python lists that we discussed in the last chapter. The difference is that Numpy arrays perform faster and are more memory efficient. That makes it much easier to process large data sets that are commonly used in machine learning models.

In [39]:
np.array([0,1,2])

array([0, 1, 2])

With the array function, we can also create a matrix of values. Let's create an Array_1 with 2 groups of numbers, each group is written within a pair of square brackets. Now let's print Array_1 and we can see the 2 groups listed below.

In [40]:
# Creating a matrix
Array_1 = np.array([[1,2,3],[4,5,6]])
Array_1

array([[1, 2, 3],
       [4, 5, 6]])

We can assign an array with other types of data as well. For example, let's create an Array_2 with 4 numbers in one group, and a mix of string and numbers in another group.

The result shows that one string value in the array turns all values into strings even if you did not put quotation marks around the numbers. This is different from Python lists where you can have different types of data in one list.

In [41]:
# Arrays in Python can also have words.
Array_2 = np.array([[1,2,3,4],["hello world",5,6,7]])
Array_2

array([['1', '2', '3', '4'],
       ['hello world', '5', '6', '7']], dtype='<U11')

## Operations with NumPy Arrays
Once we create a numpy array, we can use the `numpy.shape` to see the dimension of the array. This is an important function because when we start working with bigger and more complex data sets, it can be important to know its shape. Let's look at the shape of Array_1 that we created earlier. The result tells us this array has 2 rows and 3 columns.

In [42]:
Array_1

array([[1, 2, 3],
       [4, 5, 6]])

In [44]:
Array_1.shape

(2, 3)

To call a particular cell within an array, we use the row and column index to identify the location of the value. Let's try to call the item in row 2 and column 1 and we get the value 4.

In [27]:
# Calling a particular cell within array [row, column]
Array_1

array([[1, 2, 3],
       [4, 5, 6]])

In [28]:
# Retrieve the first item (index 0) from the second row (index 1)
Array_1[1,0]

4

Now let's look at how to access the minimum or maximum value of an array. Still using Array_1 as an example, we can apply `min()` and `max()` function to find the smallest and the largest values in the array.

In [7]:
print(Array_1.min())
print(Array_1.max())

1
6


# Pandas

The second Python package that we will explore is Pandas. Pandas offers powerful data structures that help analyze and manipulate data. It is an open source Python package that is most widely used for data science and analysis that is built on top of Numpy. 

Pandas automates tasks that are time consuming and repetitive. Some uses of Pandas include:

 1. Data cleaning - systematically clean dirty data
 2. Loading and saving data - Easily import data from an external source and export to your local computer
 3. Filling data - systematically fill in data
 4. Joining data - Merge datasets together
 5. Statistical analysis - Run statstistcal analysis on datasets easily

In [8]:
import pandas as pd

## Pandas Series
The first type of data structure in Pandas is called series. Series is an one-dimensional array with index labels. It can also hold different types of data. Series can be made from lists, dictionaries, and numpy arrays.

Series are useful when trying to make simple and organizied data that can be quickly digested.

Consider the code below for different types of data that will be used to create a series.

In [9]:
markers = ['a','b','c'] 
list_1 = [12,24,36]
array_1 = np.array([15,30,45])
dict_1 = {'d':20,'e':40,'f':60}

We will first apply `list_1` into a series. Please note that Python is case sensitive and series needs to be written with a capital S.

In [10]:
#Applying list into a series
pd.Series(data = list_1)

0    12
1    24
2    36
dtype: int64

We see that pandas automatically assigned indexs (0, 1, 2) to the list when the series was created. We can change the index of a series by using the `markers` list we mentioned above by inserting it into the series code.

In [11]:
#Changing the index of a series
pd.Series(data = list_1, index = markers)

a    12
b    24
c    36
dtype: int64

We can make a series the same way with an array and a dictionary. Using dictionaries to make a series is different, as dictionaries already have their own indicies assigned. The data is inputted the same way, but there is no requirement to add index data.

In [12]:
#Apply array into a series
pd.Series(data = array_1, index = markers)

a    15
b    30
c    45
dtype: int32

In [13]:
# Apply dictionary into a series
pd.Series(dict_1)

d    20
e    40
f    60
dtype: int64

We can see that pandas shows what the data type of the series, which in this case is an integer. A series can hold not just integers, but many othe data objects. 

In [14]:
a = "Mike"
b = True
c = sum
pd.Series([a,b,c]) 

0                       Mike
1                       True
2    <built-in function sum>
dtype: object

## Operations with Pandas Series
Similar to a dictionary, you can use the index of a series to easily look up values. Consider the following gift shop data. We can access a value by calling the index, similar to a dictionary. If we want to see the sales on magnets. We just need to type the series name, and call magnets in sqare brackets.

In [15]:
Gift_shop_salesQ1 = pd.Series(data=[300,550,240,180],index = ['Magnets', 
                                                              'Coasters',
                                                              'Handbags', 
                                                              'Snacks'])                  
Gift_shop_salesQ1

Magnets     300
Coasters    550
Handbags    240
Snacks      180
dtype: int64

In [33]:
#Gather a value by the index
Gift_shop_salesQ1['Magnets']

300

One powerful use of a series is performing series operations. Consider the following Q2 data that has different revenue numbers. We can add the two series (Gift_shop_salesQ1, Gift_shop_salesQ2) together as they have the same index and find the bi-annual revenue numbers.

In [17]:
Gift_shop_salesQ2 = pd.Series([340,600,225,75],index = ['Magnets', 'Coasters',
                                                        'Handbags', 'Snacks'])  
Gift_shop_salesQ2

Magnets     340
Coasters    600
Handbags    225
Snacks       75
dtype: int64

In [18]:
Gift_shop_salesQ1 + Gift_shop_salesQ2

Magnets      640
Coasters    1150
Handbags     465
Snacks       255
dtype: int64

However, in instances when you are trying to combine series that have different indicies, only the shared indicies amongst the series will return a value, with the rest being null. Since `Gift_shop_salesQ3` introduced new indicies, these indicies that are not present in all three series are left as null.

In [19]:
Gift_shop_salesQ3 = pd.Series([480,520,360,40],index = ['Magnets', 'Coasters',
                                                        'Handbags', 'Postcards'])  
Gift_shop_salesQ1 + Gift_shop_salesQ2 + Gift_shop_salesQ3

Coasters     1670.0
Handbags      825.0
Magnets      1120.0
Postcards       NaN
Snacks          NaN
dtype: float64

## Pandas DataFrame

By far the most frequent use of pandas is the DataFrame. DataFrame is a 2-dimensional data structure that contains rows and columns of data, similar to an Excel table.

DataFrames allow us to leverage the power of pandas in ways mentioned earlier in the lesson, such as clean the data, manipulate the data, and perform statistical analysis.

There are multiple ways of creating a DataFrame. The first method that we will look at is creating it with an array.

In [35]:
Frames = np.array([[1,2,3],[4,5,6],[4,5,6],[4,5,6],[4,5,6]])
Frames

array([[1, 2, 3],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6]])

There's no labels on the rows and columns of this numpy array. In order to prepare for the DataFrame, we'll create a list called rows with five characters to lable the five rows, and another list called columns with three characters to label the three columns.

We will now generate a DataFrame called dataframe_1, which will contain the information from the `Frames` array as well as the row and column names.

In [36]:
Rows= ['A', 'B', 'C', 'D', 'E']
Columns= ['X', 'Y', 'Z']

dataframe_1 = pd.DataFrame(Frames,Rows,Columns)
dataframe_1

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,4,5,6
D,4,5,6
E,4,5,6


Similar to a pandas series, DataFrames can store various types of data, such as integers, strings, lists, etc.

We will now create a DataFrame with a dictionary. 

In [37]:
dictionary_2 = {'Grade':['A','A','B','B','C','C'],
                'Price':[125,236,300,300,472,600],
                'State':['NY','CA','IL','IL','WI','NV']}

dataframe_2 = pd.DataFrame(data=dictionary_2)
dataframe_2

Unnamed: 0,Grade,Price,State
0,A,125,NY
1,A,236,CA
2,B,300,IL
3,B,300,IL
4,C,472,WI
5,C,600,NV
