# Pandas and NumPy Introduction

In [None]:
!python -V

In [None]:
%pip install numpy pandas matplotlib

In [2]:
# lets 

import numpy as np

## Data Analysis

Python is one of the most popular languages for data analysis tasks and ML/DL.

To train ML models, the data needs to be properly preprocessed. 

Pandas an NumPy and extremely popular libraries for this purpose 
that provide data structures and tools allowing for the analysis of data.

<br>

Data Analysis is needed for a variety of industries and purposes including…
* Statistics, Stock Prediction, Neuroscience, Advertising, Natural Language Processing, Netflix recommendations, and many more.

<br>

We are currently in the era of Big data. 

Data is a global asset that can be used in endless ways. 

Accessible tools like Pandas and NumPy are invaluable to the future of innovation.


## Numpy


NumPy is a library that optimizes and improves multi-dimensional list
operations in Python.
* NumPy is used over regular lists because of its increased speed and efficient
memory use.
* NumPy also support element-wise operations, something that regular lists do
not possess

NumPy is primarily used for scientific analysis of numeric data. This is great for the calculation stages of data science.


In [None]:
# The Numpy array is fixed size, unlike python lists

ListA = [0, 2, 3, 6, 1, 8, 7]
arr = np.array(ListA)
arr

In [None]:
print("Dimension", arr.ndim)
print("Shape", arr.shape)
print("Element Type", arr.dtype)

In [None]:
# create a 2d numpy array from multiple lists

ListA = [0, 1, 2, 3]
ListB = [4, 5, 6, 7]
ListC = [8, 9, 10, 11]

array_2d = np.array([ListA, ListB, ListC])
array_2d

In [None]:
print("Dimension", array_2d.ndim)
print("Shape", array_2d.shape)
print("Element Type", array_2d.dtype)

### NumPy basic indexing and selecting
---

In [None]:
# NumPy indexing goes (col, row) like python, unlike pandas

ListA = [0, 2, 3, 6, 1, 8, 7]
arr = np.array(ListA)

# grab elements before index 2
ListA[:2]

In [None]:
# grab elements between index 4 and 8
ListA[4:8]

In [None]:
# Since NumPy arrays are numerical, there is no option for label selecting
# You do not need iloc for row/col selection

ListA = [0, 1, 2, 3]
ListB = [4, 5, 6, 7]
ListC = [8, 9, 10, 11]

array_2d = np.array([ListA, ListB, ListC])

# Grab all rows from the first two columns
array_2d[:, [0, 2]]

In [None]:
# Grab all rows from columns index 0, 2, and 3
array_2d[:, 0:2:3]

In [None]:
# You can assign a specific value to a specific array location

# assign row 0, column 2 to value 5
array_2d[0, 2] = 5
array_2d

In [None]:
# assign all of row 0 to 5
array_2d[0, :] = 5
array_2d

### NumPy functions
---

In [None]:
# you can create an array full of zeros with np.zeros

arr = np.zeros((2, 5))
print(arr)

In [None]:
# You can create an array full of ones with np.ones

arr = np.ones((2, 5))
print(arr)

In [None]:
# You can fill a numpy array with a specific value using np.full

arr = np.full((2, 5), 100)
print(arr)

In [None]:
# You can fill a numpy array with random integers using np.random.randint

# Create an array of size (2, 5) with integers from 0 to 10
arr = np.random.randint(10, size=(2, 5))
print(arr)

In [None]:
# Create an array of size (2, 5) with random numbers between 0 and 1
arr = np.random.rand(2, 5)
print(arr)

In [None]:
# use np.copy() to copy one numpy array onto another

arr2 = np.copy(arr)
arr2

### Mini Challenge! :)

Using what you've learned, create a NumpPy array of the grid shown

<img src="extra/imgs/grid.png" width=400px height=400px />


---

### Mini-Challenge 2
1. Create a numpy array with the shape (3, 3, 3) with random integers. Print it out.s
2. Create another one. Add them and print the results.
3. Now try multiplying them.
4. Get a slice of the resulting array such the shape of the slice is (3,3). How do you think the multiplication worked?

In [4]:
arr = np.random.randint(2, size=(3,3,3))
arr2 = np.random.randint(2, size=(3,3,3))

print(arr)
print()
print(arr2)


[[[0 1 0]
  [1 0 0]
  [0 0 0]]

 [[0 1 0]
  [0 1 0]
  [1 1 1]]

 [[1 0 0]
  [0 0 1]
  [1 0 0]]]

[[[1 1 1]
  [1 0 1]
  [0 1 0]]

 [[1 0 0]
  [0 1 0]
  [0 1 1]]

 [[0 0 1]
  [0 0 0]
  [0 0 1]]]


In [5]:
print(arr + arr2)

[[[1 2 1]
  [2 0 1]
  [0 1 0]]

 [[1 1 0]
  [0 2 0]
  [1 2 2]]

 [[1 0 1]
  [0 0 1]
  [1 0 1]]]


In [6]:

arr3 = arr * arr2
arr3[0, :, :]

array([[0, 1, 0],
       [1, 0, 0],
       [0, 0, 0]], dtype=int32)

## Pandas

Pandas is a library in Pythonprovides data analysis tools and structures for the manipulating numeric and time series data.

Pandas is great for the preparation / wrangling stages of data science. 

This involves importing, organizing, labeling, and manipulating the structure and contents. 



---

In [None]:
# To start working with pandas, you typically import your data from a csv file to a dataFrame

df = pd.read_csv('pokemon.csv')

In [None]:
# Converting a dataframe into a NumPy array

arr = df.to_numpy()
print(f'{arr} \n {type(arr)}')

### Pandas Data Structures
---

The two main data structures used in pandas are Series, and DataFrames

In [None]:
# A series is a 1-Dimensional array used to store any data type

# Numeric Data
Data = [1, 3, 4, 5, 6, 2, 9]

# Predefined index values
Index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

# Creating a series with default index values
s = pd.Series(Data)

# Creadting a series with predefined index values
si = pd.Series(Data, Index)

In [None]:
print(s)

In [None]:
print(si)

In [None]:
# A DataFrame is a 2D representation of a set of data

data = {
    'Name': ['Sara', 'Andrew', 'Jack', 'Vanamala'],
    'Age': [21, 19, 23, 28]
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

### Mini Challenge ! :)

Create a DataFrame with your own predifined columns, index, and data

Use at least 3 different columns

---

### Indexing and Selecting Data
---

In [None]:
# Example cat owner dataset

cat_owners_data = {
    'John': ['Garfield', 30, 7, 5],
    'Jack': ['Nermal', 5, 1, 9], 
    'Andrew': ['Socks', 9, 3, 8],
    'Vanamala': ['Destroyer', 8, 6, 3]
}

cat_owners_index = ['Cat Name', 'Weight', 'Age', 'Lives']

cat_owners_df = pd.DataFrame(cat_owners_data, cat_owners_index)

cat_owners_df

In [None]:
# methods for indexing and selecting data from DataFrames are .loc() and .iloc()
# .iloc() selects elements using integer labels 

# grabs elements from the second indexed row
cat_owners_df.iloc[1]

In [None]:
# grabs elements from multiple rows
cat_owners_df.iloc[[0, 2, 3]]

In [None]:
# grab elements from the middle two rows, duration and hits
cat_owners_df.iloc[1:3]

In [None]:
# grab elements from the last two rows, and the first two columns
cat_owners_df.iloc[[2, 3], [0, 1]]

In [None]:
# .loc grabs elements using labels
# unlike python arrays, pandas 2-d indexing goes df[rows, cols] instead of [col, rows]

# grab data only from the row 'misses'
cat_owners_df.loc['Cat Name']

In [None]:
# grab data from multiple rows
cat_owners_df.loc[['Cat Name', 'Lives']]

In [None]:
# grabs all rows from column Vanamala

cat_owners_df.loc[:, ['Vanamala']]

In [None]:
# grab rows calories and duration from column P2 and p3

cat_owners_df.loc[['Age', 'Lives'], ['Jack', 'Vanamala']]

### Mini Challenge ! :)

Write code using iloc() and .loc() to grab the Cat Name and Cat Age rows from owner John

---

### Viewing Data
---

In [None]:
# .head() displays the first x rows of your data

cat_owners_df.head()

In [None]:
# using a number parameter specifies how many rows you want to display
cat_owners_df.head(2)

In [None]:
# .tail() displays the first x rows of your data

cat_owners_df.tail()

In [None]:
cat_owners_df.tail(3)

In [None]:
# use .describe() to see statistical descriptions of your data quickly
cat_owners_df.describe()

In [None]:
# use .columns to display your column labels
cat_owners_df.columns

In [None]:
# use .index to display your index labels
cat_owners_df.index

### Mini Challenge ! :)

Grab the last two rows of the df using .tail()

---

### Operations
---

In [None]:
# Example Ice cream dataset

icecream_order_data = {
        'Jonathan': ['Chocolate', 20],
        'Evan': ['Mint Chip', 13], 
        'Audrey': ['Chocolate', 18],
        'Olivia': ['Vanilla', 10],
        'Ethan': ['Mint Chip', 29],
        'Jasmine': ['Chocolate', 23],
        'Zack': ['Superman', 5]  
}

icecream_order_index = ['Flavor', 'Age']

icecream_order_data = {
        'Flavor': ['Chocolate', 'Mint Chip', 'Chocolate', 'Vanilla', 'Mint Chip', 'Chocolate', 'Superman'],
        'Age': [20, 13, 18, 10, 29, 23, 5]
}

icecream_order_index = ['Jonathan', 'Evan', 'Audrey', 'Olivia', 'Ethan', 'Jasmine', 'Zack']

icecream_order_df = pd.DataFrame(icecream_order_data, icecream_order_index)

icecream_order_df

In [None]:
# .mean() displays the mean value of your selected data
mean = icecream_order_df['Age'].mean()
print(mean)

In [None]:
# .sum() displays the sum of your selected data
sum = icecream_order_df['Age'].sum()
print(sum)

In [None]:
# .value_counts counts the number of times a unique value appears
icecream_order_df['Flavor'].value_counts()

In [None]:
# .apply() will take a function and apply it to your data.
# This can be a NumPy function

icecream_order_df.loc[:, ['Age']].apply(np.sum)

### Mini Challenge ! :)

Use the .mean() function with your previously created DataFrame

---

### Manipulating Data

---

In [None]:
# For this section, we will be using the cat owners Dataset

cat_owners_df

In [None]:
# Manually add a new column to the DataFrame

cat_info = ['Mittens', 7, 5, 9]
cat_owners_df['Dr.Fine'] = cat_info

cat_owners_df

In [None]:
# To insert a list as a new column in a dataframe you can use the .insert() function.
# .insert(loc, column, value) 

cat_info = ['Salem', '5', '10', '1']

cat_owners_df.insert(5, 'Sabrina', cat_info)

cat_owners_df

In [None]:
# Insert new column into the 3rd index 

cat_info = ['Pickles', '2', '6', '9']

cat_owners_df.insert(3, 'Sara', cat_info)

cat_owners_df

In [None]:
# To delete a row/column, use the .drop() function
# .drop(label, axis, inplace)
# axis specifies 1=columns, 0=rows
# inplace determines if the original dataframe is altered

cat_owners_df.drop('Sabrina', axis=1, inplace=True)

cat_owners_df

In [None]:
# Create a new dataFrame where the row "Lives" is dropped

cat_owners_df_2 = cat_owners_df.drop('Lives', axis=0, inplace=False)

cat_owners_df_2

### Mini Challenge ! :)
Using your created DataFrame, use .insert() to add a new column, and .drop() to remove a row

---

### Merging Data

---

In [None]:
# Example employee data
data1 = {
    'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
    'Age': [27, 24, 22, 32],
    'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
    'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
    

data2 = {
    'Name': ['Abhi', 'Ayushi', 'Dhiraj', 'Hitesh'],
    'Age': [17, 14, 12, 52],
    'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
    'Qualification': ['Btech', 'B.A', 'Bcom', 'B.hons']
}

df1 = pd.DataFrame(data1, index=[0, 1, 2, 3])

df2 = pd.DataFrame(data2, index=[4, 5, 6, 7])

df1

In [None]:
# merge using the .concat() function

frames = [df1, df2]

df3 = pd.concat(frames)

df3

In [None]:
# concat two dataframes and group

df3 = pd.concat(frames, keys=['DF1', 'DF2'])

df3

In [None]:
# Example dataframes

data1 = {
    'key': ['K0', 'K1', 'K2', 'K3'],
    'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
    'Age':[27, 24, 22, 32],} 
   
data2 = {
    'key': ['K0', 'K1', 'K2', 'K3'],
    'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 
    'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']
} 

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

In [None]:
# use the .merge function to merge two dataframes based on a key

df3 = pd.merge(df2, df2, on='key')

df3

### Intermediate pandas with Pokemon Dataset

---

In [None]:
# importing csv data

pokemon_df = pd.read_csv('pokemon.csv')

pokemon_df.head()

In [None]:
# you can sort data using .sort_values() function

# Sort the DataFrame by Pokemon Name in alphabetical order
pokemon_df.sort_values(by='Name')

In [None]:
# Sort the DataFrame by Pokemon Name in reverse alphabetical order
# using the ascending = False parameter
pokemon_df.sort_values(by='Name', ascending=False)

In [None]:
# Sort the DataFrame numerically using Pokemon HP
pokemon_df.sort_values(by='HP', ascending=False)

In [None]:
# You can also sort using multiple values
pokemon_df.sort_values(by=['Type 1', 'HP'], ascending=False)

In [None]:
# Create a column totaling the sum of the major stats
pokemon_df['Total_Atk_Dfn'] = pokemon_df['Attack'] + pokemon_df['Defense']
pokemon_df

In [None]:
# Create a column totaling the sum of the major stats using iloc
pokemon_df['Total_Atk_Dfn'] = pokemon_df.iloc[:, 6:8].sum(axis=1)
pokemon_df

In [None]:
# Select based on one conditions using loc

# Select all pokemon with type 1 = Grass
pokemon_df.loc[pokemon_df['Type 1'] == 'Grass']

In [None]:
# Select based on Multiple conditions using loc

# Select all pokemon with type 1 = Grass and Type 2 = Poison
pokemon_df.loc[(pokemon_df['Type 1'] == 'Grass') & (pokemon_df['Type 2'] == 'Poison')]

In [None]:
# Select all pokemon with type 1 = Grass and Type 2 = Poison and HP > 60
pokemon_df.loc[(pokemon_df['Type 1'] == 'Grass') & (pokemon_df['Type 2'] == 'Poison') & (pokemon_df['HP'] > 60)]

In [None]:
# the .contains() function can be used to select data with specific strings

# Select all Mega pokemon
pokemon_df.loc[pokemon_df['Name'].str.contains('Mega')].head()

In [None]:
# the ~ can be used to select the opposite

# select all non-Mega pokemon
pokemon_df.loc[~pokemon_df['Name'].str.contains('Mega')].head()

## Visualizing Data

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Create a diagonal line in a graph

xpoints = np.array([0, 6])
ypoints = np.array([0, 250])

# Use the matplotlib plot function to draw points on a diagram
# The first parameter contains the x axis points, while the second contains y axis points
plt.plot(xpoints, ypoints)
plt.show()

In [None]:
# Using different markers
# markers can be specified in the plt function parameters using string notation

xpoints = np.array([1, 8])
ypoints = np.array([3, 10])

# Plot as ring markers
plt.plot(xpoints, ypoints, 'o')
plt.show()

In [None]:
# Plot multiple points
# You can use the marker keyword to emphasize certian points
xpoints = np.array([1, 2, 6, 8])
ypoints = np.array([3, 8, 1, 10])

plt.plot(xpoints, ypoints, marker = '*')
plt.show()

In [None]:
# Use string notation shortcut to specify your line and points
# marker|line|color

plt.plot(xpoints, ypoints, 'x:r')
plt.show()

In [None]:
# You can also use parameter linestyle or ls, to specify the line

plt.plot(xpoints, ypoints, linestyle='dotted')
plt.show()

In [None]:
# You can also plot multiple lines by specifying the x and y points for each line

yx1 = np.array([0, 1, 2, 3])
y1 = np.array([3, 8, 1, 10])
x2 = np.array([0, 1, 2, 3])
y2 = np.array([6, 2, 7, 11])

plt.plot(y1)
plt.plot(y2)

plt.show()

In [None]:
# You can add labels to your plot using plt.xlabel, plt.ylabel, and plt.title

plt.plot(y1)
plt.plot(y2)
plt.xlabel('X axis')
plt.ylabel('Y axis')
plt.title('Title')
plt.show()

In [None]:
# You can add a grid to your chart using plt.grid()

plt.plot(y1)
plt.plot(y2)
plt.grid()

plt.show()

In [None]:
# You can create a scatter plot using plt.scatter()

x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])

plt.scatter(x, y)
plt.show()

In [None]:
# You can create a bar graph using plt.bar()

x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])

plt.bar(x, y)
plt.show()