# Pandas Crash Course / Workbook

- what is a pd.Series
  - [answer](https://pandas.pydata.org/docs/user_guide/dsintro.html#series)
  a: is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the
- what is a pd.DataFrame
  - [answer](https://pandas.pydata.org/docs/user_guide/10min.html#basic-data-structures-in-pandas)
  a: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.
- how is pandas different/'better' than a 2D numpy matrix  
    - a: numpy is a c-based code that is similar to matlab. It's very fast, because of this but also because it only allows numerical input, not strings etc
    pandas is built off numpy but can also take many different types of variables 




### Numpy Array Basics
ISLP pdf pg. 52 (notes in markdown)
<!-- 
- numpy
  - 2.3.3
  - 2.3.5
  - 2.3.6 - up to #60 (feel free to read the rest, but you DON"T NEED TO KNOW IT)
- pandas
  - 2.3.7
  - 2.3.8 optional
- assessing model accuracy
  - 3.1.3
 -->

In [43]:
import numpy as np

# create an array of numbers from 1 to 10

# manual way:
np.array([1,2,3,4,5,6,7,8,9,10])


x = list(range(1,11))
x
np.array(x)

# or, use np.arange:
np.arange(1,11)


# using a list 
x = [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]
np.array(x)


# create an array of random numbers, len 10
rand_array = np.random.randint(11, size=10)

# show the length of the array

len(rand_array)

# or

rand_array.shape


# create a 2D array (2,3) of random numbers
np.random.rand(2,3)


# HINT: np.random.rand?

array([[0.62953296, 0.24110372, 0.56781533],
       [0.91165783, 0.90451459, 0.64721689]])

In [48]:
# create an array of numbers ordered 0 to 9; named x
x = np.arange(0,10)
x


# slice the 0th value of the array
x[0]

# slice the last value of the array
x[-1]


# slice the last 3 values from the 1D array i.e. [7,8,9]
x[-3:]

# slice the 3rd to the 6th (inclusive) values from the array i.e. [3,4,5,6]

x[3:7]

array([3, 4, 5, 6])

## Pandas Basics

In [49]:
import pandas as pd
from IPython.display import display

### Create a DataFrame

In [53]:
# create a dataframe with 4 columns
# - "pet"    which has cat, dog, cat, cat, bird
# - "height" which has 1, 2, 1.2, .9, .3
# - "length" which has 1.6, 2.2, 1.5, 1, .1
# - "name"   which has alan, bob, chris, dave, ed
data = {
    "pet": ("cat", "dog", "cat", "cat", "bird"),
    "height": (1, 2, 1.2, .9, .3),
    "length": (1.6, 2.2, 1.5, 1, .1),
    "name":("alan", "bob", "chris", "dave", "ed")
}
df = pd.DataFrame(data)

# a column centric view of the data
df

Unnamed: 0,pet,height,length,name
0,cat,1.0,1.6,alan
1,dog,2.0,2.2,bob
2,cat,1.2,1.5,chris
3,cat,0.9,1.0,dave
4,bird,0.3,0.1,ed


In [56]:
# the same problem as above but with a row-centric view of the data
# this is more 'natural' if the data is organized by row (i.e. pet-data)
#df.T


### Display Counts - i.e. Frequency Table

In [67]:
# display a frequency table of the pet categories HINT 'count' of 'values'
df['pet'].value_counts()



# display the same table but instead of absolute counts, show the % of the data for each category
# HINT look at the function args
df['pet'].value_counts(normalize=True)

cat     0.6
dog     0.2
bird    0.2
Name: pet, dtype: float64

### Challenge 1 - Find the tallest cat

In [105]:
# select the tallest cat from the dataframe - what is the name?
x = df.copy(deep=True)

# ans: filter to only cats, sort values take top height (general solution)

# creating array of bool where pet == cat
is_cat = x['pet'] == "cat"

# you filter by using df[filter_array]
x = x[is_cat]

# all in one line instead
z = x[x["pet"] == "cat"]

# now we've filtered. next step sort height values of z desc. sort on the WHOLE filtered df
# sort
z = z.sort_values('height', ascending=False)

# find first row - then column
z.iloc[0]['name']

# # creating new column. what ever string you give it is the name of the column. 
# # then fill it with the array you just made
# x["is_cat_this_a_cat"] = is_cat
# x


# get the first row i.e. .iloc[0] get the name i.e. ['name']


'chris'

In [7]:
# everything above as a one-liner
x[x["pet"] == "cat"].sort_values('height', ascending=False).iloc[0]['name']


### Challenge 2 - Find the shortest name

In [138]:
# select the shortest name from the df - what name is it?

# ans: get an array of name string lengths

# reset x
x = df.copy(deep = True)

# apply does element wise operation of a function
# lambda is a one line function of syntax: lambda arg: function(arg)
# e.g. lambda z: z.upper()
# def lambda(z): return z.upper()
name_len = x['name'].apply(lambda ele: len(ele))

# add name_len to df x
x["name_len"] = name_len

# sort, then pick out the first row, column name = 'name'
x.sort_values('name_len').iloc[0]["name"]

# get a bool array of the names that equal min str length
z = df.copy(deep=True)
zname_len = z['name'].apply(lambda ele: len(ele))


z = z[]

# mask the array and get the names


### Column Manipulations

In [142]:
# order the dataframe according to height ascending

# reset x
x = df.copy(deep = True)

x = x.sort_values('height')
x

Unnamed: 0,pet,height,length,name
4,bird,0.3,0.1,ed
3,cat,0.9,1.0,dave
0,cat,1.0,1.6,alan
2,cat,1.2,1.5,chris
1,dog,2.0,2.2,bob


In [145]:
# create a column named area which is the length times height
x["area"] = x['height'] * x['length']
x

Unnamed: 0,pet,height,length,name,area
4,bird,0.3,0.1,ed,0.03
3,cat,0.9,1.0,dave,0.9
0,cat,1.0,1.6,alan,1.6
2,cat,1.2,1.5,chris,1.8
1,dog,2.0,2.2,bob,4.4


In [148]:
# drop the column area
x = x.drop(columns=['area'])
x

Unnamed: 0,pet,height,length,name
4,bird,0.3,0.1,ed
3,cat,0.9,1.0,dave
0,cat,1.0,1.6,alan
2,cat,1.2,1.5,chris
1,dog,2.0,2.2,bob


In [150]:
# select only the pet and name columns

x = x[["pet","name"]]
x

Unnamed: 0,pet,name
4,bird,ed
3,cat,dave
0,cat,alan
2,cat,chris
1,dog,bob


In [152]:
# reorder the columns to be in the order of ['name', 'pet', 'length', 'height']

# reset x
x = df.copy(deep = True)
x[['name', 'pet', 'length', 'height']]

Unnamed: 0,name,pet,length,height
0,alan,cat,1.6,1.0
1,bob,dog,2.2,2.0
2,chris,cat,1.5,1.2
3,dave,cat,1.0,0.9
4,ed,bird,0.1,0.3


In [161]:
# rename the columns to ["a", "b", "c", "d"]
x.index = ["a", "b", "c", "d", 'e']
x

Unnamed: 0,a,b,c,d
a,cat,1.0,1.6,alan
b,dog,2.0,2.2,bob
c,cat,1.2,1.5,chris
d,cat,0.9,1.0,dave
e,bird,0.3,0.1,ed


In [167]:
# rename the column "pet" to "animal_pal"
# reset x
x = df.copy(deep = True)
x = x.rename(columns={"pet":"animal_pal"})
x

Unnamed: 0,animal_pal,height,length,name
0,cat,1.0,1.6,alan
1,dog,2.0,2.2,bob
2,cat,1.2,1.5,chris
3,cat,0.9,1.0,dave
4,bird,0.3,0.1,ed


### Subset DataFrames & Index Adjustments
The key concept here is that df's return copies... and you have to be explicit about returning a copy (with some adjustment) vs. mutating the dataframe

In [180]:
# select the subset of columns pet, name - as a new dataframe


# select only cats
# reset x
x = df.copy(deep = True)
x[x["pet"] == 'cat']
x

# set the index to name - perminantly
x = x.set_index('name')
x


# reset the index - perminantly
x = x.reset_index()



# delete new


### DataFrame Filtering

In [184]:
# filter the dataframe to only dogs
x[x['pet'] == 'dog']

# filter the dataframe to only height >= 1
x[x['height'] >= 1]

# filter the dataframe to only names ed and alan
x[x['name'].isin(["ed", "alan"])]


Unnamed: 0,name,pet,height,length
0,alan,cat,1.0,1.6
4,ed,bird,0.3,0.1


### String Operations

In [191]:
# rename all the columns to be uppercase
x.columns = [col.upper() for col in x.columns]
x

# set all the columns back to lowercase
x.columns = [col.lower() for col in x.columns]
x


# rename all the names to be capitalized HINT check string methods
# set the capatalized values as the values in the df!
x["cap_name"] = x["name"].str.capitalize()
x

x['lam_name'] = x['name'].apply(lambda ele: ele.capitalize())
x

Unnamed: 0,name,pet,height,length,cap_name,lam_name
0,alan,cat,1.0,1.6,Alan,Alan
1,bob,dog,2.0,2.2,Bob,Bob
2,chris,cat,1.2,1.5,Chris,Chris
3,dave,cat,0.9,1.0,Dave,Dave
4,ed,bird,0.3,0.1,Ed,Ed


### Challenge 3 - Add a Row of Data
This is problem is a one-row version of "stacking" dataframes (i.e. UNION in SQL)

In [225]:
# create a copy of the df
x = df.copy(deep=True)
x
# add a row with ["dog", 2, 3, "frank"]
# this creates a df with 1:1 parity with our initial dataset
row = ["dog", 2, 3, "frank"]
arrays = [[v] for v in row]
d = dict(zip(x.columns, arrays))
new_row = pd.DataFrame(d)         

# use concat to stack these two dfs!
pd.concat([x,new_row])

Unnamed: 0,pet,height,length,name
0,cat,1.0,1.6,alan
1,dog,2.0,2.2,bob
2,cat,1.2,1.5,chris
3,cat,0.9,1.0,dave
4,bird,0.3,0.1,ed
0,dog,2.0,3.0,frank


In [252]:
# create a copy of the df
x = df.copy(deep=True)
x

z = x.T
z[5] = ["dog", 2, 3, "frank"]
x = z.T