<a href="https://colab.research.google.com/github/u5638928/u5638928-DataScience-GenAI-Submissions/blob/main/Copy_of_2_01_data_and_feature_engineering_in_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://drive.google.com/uc?export=view&id=1xqQczl0FG-qtNA2_WQYuWePW9oU8irqJ)

# Data and Feature Engineering
This Notebook will introduce us to preparing our datasets for data science and AI workloads. For sturctured data, this will often utilise popular Python libraries such as Pandas, which will be the focus for this Notebook.

### What is Pandas?
As Python has grown to become a popular solution for data analysis, many new tools have been introduced to support such tasks. Pandas is arguably the most popular and widely used of these, although it does have some limitations (if working with very large datasets you may want to look at Dask and/or PySpark).

First, we will import it into our session. This effectively means we have opened the library in the background, and can now use the [functions](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_10_functions.ipynb) built into it. We will also import numpy - another widely used Python package for working with numerical data (the name is a portmanteau of "numerical" and "python"). By convention we import pandas as "pd" and numpy as "np". This is equivalent to giving it a variable name.

In [None]:
import pandas as pd
import numpy as np

### Testing our installation
To test everything is working we will create some fake data frame using numpy and load it into a pandas dataframe (more on these below). First we will create the random numbers:

In [None]:
x = np.random.rand(10,1)
x

array([[0.02775904],
       [0.63470896],
       [0.800166  ],
       [0.24403324],
       [0.63447658],
       [0.54835148],
       [0.41321498],
       [0.04518785],
       [0.6375421 ],
       [0.46120638]])

The commands here have told numpy to create a set of random numbers between zero and one. The arguments we have passed, "1" and "10", tells numpy we want a 10x1 array of numbers (i.e. a table (more accurately a vector) with 10 rows and 1 column).

Next we will create a pandas dataframe using "x":

In [None]:
testdf = pd.DataFrame(x)
testdf

Unnamed: 0,0
0,0.027759
1,0.634709
2,0.800166
3,0.244033
4,0.634477
5,0.548351
6,0.413215
7,0.045188
8,0.637542
9,0.461206


We have now successfully created a pandas dataframe!

### What are Dataframes?
Pandas has a very elegant way of managing data, very much borrowed from the statistical language R, mostly based around dataframes. You can think of dataframes a bit like an Excel table with rows, columns and common operations like sum, average and so on. We can use this on top of our previous work on lists and dictionaries (etc.) more usable and malleable.

To begin with we will create a dataset - in this case a [dictionary](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_05_dictionaries.ipynb).



In [None]:
orders = {'o10001':{'date':'2024/01/10', 'product':'Hoodie', 'quantity':'1'},
            'o10002':{'date':'2024/01/13', 'product':'Tote bag', 'quantity':'2'},
            'o10003':{'date':'2024/01/14', 'product':'Pencil', 'quantity':'10'},
            'o10004':{'date':'2024/01/15', 'product':'T-shirt', 'quantity':'2'}
}
orders

{'o10001': {'date': '2024/01/10', 'product': 'Hoodie', 'quantity': '1'},
 'o10002': {'date': '2024/01/13', 'product': 'Tote bag', 'quantity': '2'},
 'o10003': {'date': '2024/01/14', 'product': 'Pencil', 'quantity': '10'},
 'o10004': {'date': '2024/01/15', 'product': 'T-shirt', 'quantity': '2'}}

We can convert this to a Dataframe with great ease

In [None]:
import pandas as pd
import numpy as np

orders_df = pd.DataFrame(orders)
orders_df

Unnamed: 0,o10001,o10002,o10003,o10004
date,2024/01/10,2024/01/13,2024/01/14,2024/01/15
product,Hoodie,Tote bag,Pencil,T-shirt
quantity,1,2,10,2


We can even create such outputs from more complex dictionaries, such as a dictionary which includes a nested list:

In [None]:
customers = {'Mark':{'name':'Mark Johnson', 'open_orders':3, 'orders':['o10001', 'o10002', 'o10004']},
             'Katy':{'name':'Katy Hoad', 'open_orders':0, 'orders':[]},
             'Angela':{'name':'Angela Lorenz', 'open_orders':1, 'orders':['o10003']},
             'Bo':{'name':'Bo Kelestyn', 'open_orders':0, 'orders':[]}
}
customers

{'Mark': {'name': 'Mark Johnson',
  'open_orders': 3,
  'orders': ['o10001', 'o10002', 'o10004']},
 'Katy': {'name': 'Katy Hoad', 'open_orders': 0, 'orders': []},
 'Angela': {'name': 'Angela Lorenz', 'open_orders': 1, 'orders': ['o10003']},
 'Bo': {'name': 'Bo Kelestyn', 'open_orders': 0, 'orders': []}}

In [None]:
customers_df = pd.DataFrame(customers)
customers_df

Unnamed: 0,Mark,Katy,Angela,Bo
name,Mark Johnson,Katy Hoad,Angela Lorenz,Bo Kelestyn
open_orders,3,0,1,0
orders,"[o10001, o10002, o10004]",[],[o10003],[]


Pandas dataframes can also be built from various other data structures and sources (including Excel files, CSVs, text files, databases and many more). For example, we can make our dataframe from a set of [lists](https://github.com/MJMortensonWarwick/IB2AD0_Data_Science_GenerativeAI/blob/main/1_04_lists.ipynb):

In [None]:
a = [1, 2, 3, 4]
b = ["a", "b", "c", "d"]
c = [True, False, True, False]

listdf = pd.DataFrame([a, b, c])
listdf

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,a,b,c,d
2,True,False,True,False


### EXERCISE
Try building you own dataframes from a list and/or dictionary you create. What would happen if you have an item missing from one element. E.g. if "c" in the above example only had three items - True, False, True - rather than four. Test it - does the output match your expectation?

In [None]:
import pandas as pd

my_data_list = [
    ['Apple', 100, 1.50],
    ['Banana', 150, 0.75],
    ['Orange', 80, 2.00],
    ['Grapes', 200, 3.20]
]

# Create a DataFrame from the list
# We can also provide column names for better readability
df_from_list = pd.DataFrame(my_data_list, columns=['Fruit', 'Quantity', 'Price_per_unit'])
display(df_from_list)

Unnamed: 0,Fruit,Quantity,Price_per_unit
0,Apple,100,1.5
1,Banana,150,0.75
2,Orange,80,2.0
3,Grapes,200,3.2


This code creates a list of lists, where each inner list represents a row in the DataFrame. Then, `pd.DataFrame()` is used to convert this list into a DataFrame, with optional column names for clarity.


Now I will remove the elment of price per unit from orange. I predict this missing element will create some sort of error, or just have no outcome - an N/A outcome.

In [None]:
import pandas as pd

my_data_list_missing = [
    ['Apple', 100, 1.50],
    ['Banana', 150, 0.75],
    ['Orange', 80], # Missing 'Price_per_unit'
    ['Grapes', 200, 3.20]
]

df_from_list_missing = pd.DataFrame(my_data_list_missing, columns=['Fruit', 'Quantity', 'Price_per_unit'])
display(df_from_list_missing)

Unnamed: 0,Fruit,Quantity,Price_per_unit
0,Apple,100,1.5
1,Banana,150,0.75
2,Orange,80,
3,Grapes,200,3.2


As you can see, when the 'Orange' entry was created, since it had only

1.   List item
2.   List item

**two items but the DataFrame was expecting three columns, pandas filled the missing 'Price_per_unit' value with `NaN` (Not a Number). This is similar to the behavior we observed when creating a DataFrame from `a, b, c` lists where `c` was shorter. Pandas handles inconsistencies in list lengths by padding with `NaN` to maintain the rectangular structure of a DataFrame.**

This validifies my prediciton. There is an NaN aspect that came out which stands for Not a Number. It created that NaN to keep the rectangular structure.