# ACM AI Beginner Track
## What are Numpy and Pandas?
Numpy is a linear algebra library for Python and allows us to create various objects such as arrays and also apply various operations to these objects. Numpy is especially important because Pandas is essentially built on top of Numpy.

Pandas is a powerful Python library that allows for easy manipulation of data. Once again, it has many tools that we will explore in this reference document.
## Getting started
We'll be using Google Colab for the development process. The best part about this is that the packages we will be using come pre-installed with Colab, making everyone's lives easier. However, if you do want to have a local IDE for development, then you can download the Anaconda distribution which also comes pre-installed with the packages we need plus some more. 



Before using each package you need to import it.

In [2]:
import numpy as np       
import pandas as pd

## Numpy Arrays
Numpy arrays are the core object of this package. Arrays are either vectors or matrices; vectors are 1-dimensional arrays and matrices are 2-dimensional arrays.

### Creating Numpy Arrays through Python lists

In [3]:
list = [1,2,3]
list

[1, 2, 3]

In [4]:
np.array(list)

array([1, 2, 3])

In [5]:
another_list = [[1,2,3],[4,5,6], [7,8,9]]
another_list

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [6]:
np.array(another_list)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### Creating Numpy Arrays with built-in methods

#### arange
Creates an array of evenly spaced values within an interval.

In [7]:
np.arange(0, 5, 2)

array([0, 2, 4])

In [8]:
np.arange(100, 1000, 50)

array([100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700,
       750, 800, 850, 900, 950])

#### zeros
Creates an array of zeroes.

In [9]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [10]:
np.zeros((10,10))

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

#### ones
Creates an array of ones.

In [11]:
np.ones(5)

array([1., 1., 1., 1., 1.])

In [12]:
np.ones((10,10))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

#### linspace
Creates an array of evenly spaced values over an interval.

In [13]:
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

In [14]:
np.linspace(0, 100, 10)

array([  0.        ,  11.11111111,  22.22222222,  33.33333333,
        44.44444444,  55.55555556,  66.66666667,  77.77777778,
        88.88888889, 100.        ])

#### eye
Creates an identity matrix.

In [15]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

#### rand
Creates an array of the given shape consisting of random values from a uniform distribution over [0, 1).

In [16]:
np.random.rand(5)

array([0.9784522 , 0.26068911, 0.827741  , 0.25526763, 0.5312146 ])

In [17]:
np.random.rand(5,5)

array([[0.9421108 , 0.04248036, 0.84382043, 0.19632445, 0.2345305 ],
       [0.31211313, 0.07948539, 0.12713529, 0.36660955, 0.27874199],
       [0.43124798, 0.49389983, 0.03064317, 0.85769738, 0.24761717],
       [0.87058501, 0.9010601 , 0.69966227, 0.45603046, 0.67516109],
       [0.14431467, 0.29839072, 0.06397708, 0.34145666, 0.0907616 ]])

#### randn
Creates an array of the given shape consisting of random values from the standard normal distribution.

In [18]:
np.random.randn(10)

array([-0.78353947, -0.65745364,  1.30579031, -1.18272965,  0.34299178,
       -0.27457191, -0.49104369, -1.1347884 ,  0.60900134, -1.46934418])

In [19]:
np.random.randn(2,2)

array([[-1.17326122, -0.21903218],
       [-0.42959099, -1.00440518]])

#### randint
Creates an array of random integers within an interval.

In [20]:
np.random.randint(1, 10)

6

In [21]:
np.random.randint(1, 10, 3)

array([9, 8, 3])

### Array Attributes and Methods

In [22]:
array = np.arange(10)
another_array = np.random.randint(0, 100, 25)

In [23]:
array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [24]:
another_array

array([51, 12, 99, 68, 44, 89, 29, 83, 19, 23, 17, 82, 39, 93, 43, 58, 58,
       99, 60, 28, 45, 28, 59, 78, 57])

#### Reshape

Returns an array with the same data but in a new shape.

In [25]:
another_array.reshape(5,5)

array([[51, 12, 99, 68, 44],
       [89, 29, 83, 19, 23],
       [17, 82, 39, 93, 43],
       [58, 58, 99, 60, 28],
       [45, 28, 59, 78, 57]])

#### max, min, argmax, argmin

In [26]:
# Gets the maximum value.
another_array.max()

99

In [27]:
# Gets the position of the maximum value.
another_array.argmax()

2

In [28]:
# Gets the minimum value.
another_array.min()

12

In [29]:
# Gets the the position of the maximum value.
another_array.argmin()

1

#### Shape

This is a property all arrays have whihc returns the shape of the array.

In [30]:
another_array.reshape(5,5).shape

(5, 5)

In [31]:
reshaped_array = array.reshape(10, 1)
reshaped_array

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [32]:
reshaped_array.shape

(10, 1)

#### dtype

Returns the data type of the data in the array.

In [33]:
array.dtype

dtype('int64')

## Numpy Indexing and Selection

### Bracket Indexing and Selection

Works similarly to list indexing in Python.

In [34]:
array[1]

1

In [35]:
reshaped_array[1]

array([1])

### Broadcasting

In [36]:
array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [37]:
# Setting the first 5 values to 10.
array[0:5] = 10
array

array([10, 10, 10, 10, 10,  5,  6,  7,  8,  9])

In [38]:
# Getting a "slice" of the array. In this case, all values from position 5 till the end.
slice = array[5:]
slice

array([5, 6, 7, 8, 9])

In [39]:
# Changing the slice we got above.
slice[:] = 20
slice

array([20, 20, 20, 20, 20])

In [40]:
# Our original array is changed!
array

array([10, 10, 10, 10, 10, 20, 20, 20, 20, 20])

In [41]:
# This is because we never made an explicit copy.
array_copy = array.copy()

In [42]:
# Changing the copy will not change the original now.
array_copy[:] = 100

In [43]:
array_copy

array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100])

In [44]:
array

array([10, 10, 10, 10, 10, 20, 20, 20, 20, 20])

### Indexing 2D Arrays (Matrices)

In [45]:
new_array = np.random.randint(0, 100, 16)
new_array

array([45, 32, 94, 82, 28, 39, 49, 97, 36, 42, 15, 13, 54, 80, 88, 76])

In [46]:
new_array = new_array.reshape(4,4)
new_array

array([[45, 32, 94, 82],
       [28, 39, 49, 97],
       [36, 42, 15, 13],
       [54, 80, 88, 76]])

In [47]:
# Getting the value at row x and column y: new_array[x,y] or new_array[x][y]
new_array[1,3]

97

In [48]:
# Array slicing. Below example gets first 3 rows and last 2 columns.
new_array[0:3, 2:]

array([[94, 82],
       [49, 97],
       [15, 13]])

In [49]:
# Slicing top left corner of array.
new_array[:2,:2]

array([[45, 32],
       [28, 39]])

### Selection

Selecting part of an array based on comparison operators.

In [50]:
new_array

array([[45, 32, 94, 82],
       [28, 39, 49, 97],
       [36, 42, 15, 13],
       [54, 80, 88, 76]])

In [51]:
new_array > 50

array([[False, False,  True,  True],
       [False, False, False,  True],
       [False, False, False, False],
       [ True,  True,  True,  True]])

In [52]:
# Read: "From new_array, give me an array where the values of new_array are greater than 50."
new_array[new_array > 50]

array([94, 82, 97, 54, 80, 88, 76])

## Numpy Operations

### Array Arithmetic

In [53]:
array_x = np.random.randint(0, 10, 9)
array_x = array_x.reshape(3,3)
array_x

array([[2, 7, 3],
       [1, 3, 7],
       [2, 1, 0]])

In [54]:
array_y = np.random.randint(0, 10, 9)
array_y

array([3, 0, 6, 5, 6, 5, 3, 3, 0])

In [55]:
array_x + array_x

array([[ 4, 14,  6],
       [ 2,  6, 14],
       [ 4,  2,  0]])

In [56]:
array_y * array_y

array([ 9,  0, 36, 25, 36, 25,  9,  9,  0])

In [57]:
# Throws warning because this involves dividing by 0.
array_x/array_x

  


array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1., nan]])

In [58]:
array_x ** 2

array([[ 4, 49,  9],
       [ 1,  9, 49],
       [ 4,  1,  0]])

### Universal Array Functions

You can apply mathematical operations to arrays. Check online for a full list of these functions.

In [59]:
np.sqrt(array_x)

array([[1.41421356, 2.64575131, 1.73205081],
       [1.        , 1.73205081, 2.64575131],
       [1.41421356, 1.        , 0.        ]])

In [60]:
np.exp(array_x)

array([[7.38905610e+00, 1.09663316e+03, 2.00855369e+01],
       [2.71828183e+00, 2.00855369e+01, 1.09663316e+03],
       [7.38905610e+00, 2.71828183e+00, 1.00000000e+00]])

In [61]:
np.sin(array_x)

array([[0.90929743, 0.6569866 , 0.14112001],
       [0.84147098, 0.14112001, 0.6569866 ],
       [0.90929743, 0.84147098, 0.        ]])

## Pandas Series
Pandas series are quite similar to Numpy arrays. The differences are that series can have axis labels, so they can be indexed by a label rather that a positional value. Also, series can hold any arbitrary Python data object, while arrays can only hold numeric data.

### Creating Pandas' Series

In [62]:
list =[10, 20, 30]
labels = ['a', 'b', 'c']
array = np.arange(10, 40, 10)
dict ={'a': 10, 'b': 20, 'c': 30}

In [63]:
pd.Series(data=list)

0    10
1    20
2    30
dtype: int64

In [64]:
pd.Series(data=list, index=labels)

a    10
b    20
c    30
dtype: int64

In [65]:
pd.Series(data=array, index=labels)

a    10
b    20
c    30
dtype: int64

In [66]:
pd.Series(dict)

a    10
b    20
c    30
dtype: int64

In [67]:
pd.Series(labels, list)

10    a
20    b
30    c
dtype: object

### Using the Index
You can look up data in a Series using the index. Operations are also done based off of the index.

In [68]:
series_a = pd.Series(data=[1,2,3,4], index=['a','b','c','d'])
series_a

a    1
b    2
c    3
d    4
dtype: int64

In [69]:
series_b = pd.Series(data=[10,20,30,40], index=['b','c','d','e'])
series_b

b    10
c    20
d    30
e    40
dtype: int64

In [70]:
series_a['a']

1

In [71]:
series_b['e']

40

In [72]:
# The 'a' end 'e' indices are NaN because operations are done off of the index, and the series we added 
# don't have both indices.
series_a + series_b

a     NaN
b    12.0
c    23.0
d    34.0
e     NaN
dtype: float64

## Pandas DataFrames
This is the core of Pandas. Most of the data manipulation you will do involves messing around with DataFrames. Think of them as multiple Series put together that also share the same index.

### Creating DataFrames
More often than not you will create DataFrames by importing data from CSV, or other types, of files. We will go over this later, but for now we'll learn how to make them manually.

In [96]:
# To create a basic DataFrame, you must pass in the data (which can be a numpy array) and lists 
# for the index and column names.
array = np.random.randint(0,100,25).reshape(5,5)
df = pd.DataFrame(data=array, index=['A', 'B', 'C', 'D', 'E'], columns=['V','W', 'X', 'Y', 'Z'])
df

Unnamed: 0,V,W,X,Y,Z
A,73,81,63,1,28
B,1,78,26,44,31
C,72,96,6,66,9
D,25,5,72,97,57
E,29,25,32,30,68


### Selection and Indexing

In [97]:
# You can get columns like this...
df['Z']

A    28
B    31
C     9
D    57
E    68
Name: Z, dtype: int64

In [98]:
# And rows like this...
df.loc['E']
# You can also use df.iloc[0] to get rows based on their position.

V    29
W    25
X    32
Y    30
Z    68
Name: E, dtype: int64

In [99]:
# Both rows and columns are Series.
type(df.loc['E'])

pandas.core.series.Series

In [100]:
# You can create and add new columns.
df['new'] = pd.Series(data=np.random.randint(0, 100, 5), index=['A', 'B', 'C', 'D', 'E'])
df

Unnamed: 0,V,W,X,Y,Z,new
A,73,81,63,1,28,64
B,1,78,26,44,31,15
C,72,96,6,66,9,38
D,25,5,72,97,57,81
E,29,25,32,30,68,11


In [101]:
# You can also remove rows and columns. The axis parameter denotes whether you want to remove a column
# or row. The inplace parameter denotes whether you want the removal to be permanent (if set to false, the
# function just creates a copy of the DataFrame without changing the original).
df.drop('new', axis=1, inplace=True)
df

Unnamed: 0,V,W,X,Y,Z
A,73,81,63,1,28
B,1,78,26,44,31
C,72,96,6,66,9
D,25,5,72,97,57
E,29,25,32,30,68


In [102]:
# You can select subsets of rows and columns.
df.loc['A', 'V']

73

In [103]:
df.loc[['B', 'D'], ['X', 'Y']]

Unnamed: 0,X,Y
B,26,44
D,72,97


### Conditional Selection

In [104]:
df

Unnamed: 0,V,W,X,Y,Z
A,73,81,63,1,28
B,1,78,26,44,31
C,72,96,6,66,9
D,25,5,72,97,57
E,29,25,32,30,68


In [105]:
# Read: "Give me the part of the DataFrame where the DataFrame is greater than 50."
df[df > 50]

Unnamed: 0,V,W,X,Y,Z
A,73.0,81.0,63.0,,
B,,78.0,,,
C,72.0,96.0,,66.0,
D,,,72.0,97.0,57.0
E,,,,,68.0


In [106]:
# Read: "Give me the rows where values in column W are greater than 50. "
df[df['W'] > 50]

Unnamed: 0,V,W,X,Y,Z
A,73,81,63,1,28
B,1,78,26,44,31
C,72,96,6,66,9


In [107]:
# Think of this as creating a new DataFrame where no values in column W are lesser than 50 and THEN grabbing 
# column W from that DataFrame.
df[df['W'] > 50]['W']

A    81
B    78
C    96
Name: W, dtype: int64

In [108]:
df[df['W'] > 50].loc[['A','B'],['V','Z']]

Unnamed: 0,V,Z
A,73,28
B,1,31


### Resetting the Index

In [109]:
df

Unnamed: 0,V,W,X,Y,Z
A,73,81,63,1,28
B,1,78,26,44,31
C,72,96,6,66,9
D,25,5,72,97,57
E,29,25,32,30,68


In [110]:
df.reset_index()

Unnamed: 0,index,V,W,X,Y,Z
0,A,73,81,63,1,28
1,B,1,78,26,44,31
2,C,72,96,6,66,9
3,D,25,5,72,97,57
4,E,29,25,32,30,68


In [111]:
new_index = "RED BLUE ORANGE GREEN YELLOW".split()
df["new index"] = new_index
df

Unnamed: 0,V,W,X,Y,Z,new index
A,73,81,63,1,28,RED
B,1,78,26,44,31,BLUE
C,72,96,6,66,9,ORANGE
D,25,5,72,97,57,GREEN
E,29,25,32,30,68,YELLOW


In [112]:
# This is not permanent, unless you pass in the inPlace parameter as true.
df.set_index('new index')

Unnamed: 0_level_0,V,W,X,Y,Z
new index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RED,73,81,63,1,28
BLUE,1,78,26,44,31
ORANGE,72,96,6,66,9
GREEN,25,5,72,97,57
YELLOW,29,25,32,30,68


In [113]:
df

Unnamed: 0,V,W,X,Y,Z,new index
A,73,81,63,1,28,RED
B,1,78,26,44,31,BLUE
C,72,96,6,66,9,ORANGE
D,25,5,72,97,57,GREEN
E,29,25,32,30,68,YELLOW


### Dealing with Missing Data
You will quite often import datasets with missing data. It is vital to know how to deal with this.

In [114]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [115]:
# This removes all rows with any NaN (missing) data. Specify inPlace parameter as true if you want the 
# change to be permanent.
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [116]:
# If you want to remove all columns with missing data, use the axis parameter.
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [117]:
# You can also fill any missing data with any value you want.
df.fillna("FILL")

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL,2
2,FILL,FILL,3


### GroupBy
The groupby method allows you to group rows of data together and call aggregate functions on them.

In [118]:
data = {"fav fruit": ["apple", "banana", "apple", "orange", "banana", "apple"], 
        "person": ["einstein", "tesla", "darwin", "cannon", "newton", "curie"], 
        "eaten": [3, 4, 7, 1, 6, 9]}
fruits = pd.DataFrame(data)
fruits

Unnamed: 0,fav fruit,person,eaten
0,apple,einstein,3
1,banana,tesla,4
2,apple,darwin,7
3,orange,cannon,1
4,banana,newton,6
5,apple,curie,9


In [119]:
# This creates a new groupby object. You can call aggregate functions on this.
by_fav_fruits = fruits.groupby("fav fruit")
by_fav_fruits

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f952aa3cad0>

In [120]:
by_fav_fruits.max()

Unnamed: 0_level_0,person,eaten
fav fruit,Unnamed: 1_level_1,Unnamed: 2_level_1
apple,einstein,9
banana,tesla,6
orange,cannon,1


In [121]:
by_fav_fruits.mean()

Unnamed: 0_level_0,eaten
fav fruit,Unnamed: 1_level_1
apple,6.333333
banana,5.0
orange,1.0


In [122]:
by_fav_fruits.std()

Unnamed: 0_level_0,eaten
fav fruit,Unnamed: 1_level_1
apple,3.05505
banana,1.414214
orange,


In [123]:
by_fav_fruits.count()

Unnamed: 0_level_0,person,eaten
fav fruit,Unnamed: 1_level_1,Unnamed: 2_level_1
apple,3,3
banana,2,2
orange,1,1


In [124]:
by_fav_fruits.describe()

Unnamed: 0_level_0,eaten,eaten,eaten,eaten,eaten,eaten,eaten,eaten
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
fav fruit,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
apple,3.0,6.333333,3.05505,3.0,5.0,7.0,8.0,9.0
banana,2.0,5.0,1.414214,4.0,4.5,5.0,5.5,6.0
orange,1.0,1.0,,1.0,1.0,1.0,1.0,1.0


### Data Input and Output
While you can import and export data in various formats in Pandas, you will most often end up using CSV. There are a lot of formats Pandas can read from and write to. Look them up when you can to be aware of them.

In [125]:
# This is how you input and output data as CSV.

# Input:
# df = pd.read_csv("INPUT FILE PATH HERE")

# Output:
# df.to_csv("FILE NAME HERE")

### Some things to do right after importing a dataset
The datasets you use may be huge! Its always a good idea to try to get an understanding of what the dataset looks like by running these methods.

In [130]:
# Reveals the first few rows of a DataFrame.
fruits.head()

Unnamed: 0,fav fruit,person,eaten
0,apple,einstein,3
1,banana,tesla,4
2,apple,darwin,7
3,orange,cannon,1
4,banana,newton,6


In [131]:
# Gives a general idea of all columns and their data types.
fruits.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   fav fruit  6 non-null      object
 1   person     6 non-null      object
 2   eaten      6 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 272.0+ bytes


## That's it folks!
There's a lot going on in Pandas, and this document is in no way a comprehensive list of all the methods in this library. Definitely take your time to explore the Pandas documentation online. 

You'll often run into issues/confusions when you first use Pandas! Don't fret: StackOverflow and Google are your friends.

Moreover, if you ever need help when doing anything data analytics just talk to one of the workshop instructors. We're always around to help! 