# Machine Learning Zoomcamp

## 1.9 Introduction to Pandas

Plan:

* Data Frames
* Series
* Index
* Accessing elements
* Element-wise operations
* Filtering
* String operations
* Summarizing operations
* Missing values
* Grouping
* Getting the NumPy arrays

## Understanding Pandas: A Simple Introduction

Pandas is a powerful and versatile data manipulation library in Python. It provides data structures and functions that are essential for data analysis and preprocessing tasks. With Pandas, you can easily load, manipulate, and analyze structured data. 

One of the key data structures in Pandas is the **DataFrame**. It is a two-dimensional table-like structure that allows you to store and manipulate data in a row-column format. You can think of it as a spreadsheet or a SQL table. The DataFrame is designed to handle both homogeneous and heterogeneous data, making it suitable for a wide range of applications. 

In addition to the DataFrame, Pandas also provides **Series**, which is a one-dimensional labeled array. It is similar to a column in a DataFrame and can be used to store and manipulate a single variable. Series are particularly useful when you need to perform operations on a specific column or extract a subset of data from a DataFrame. 

Pandas also offers a rich set of functions for data manipulation. You can perform operations such as filtering, sorting, transforming, and aggregating data with ease. Additionally, Pandas integrates well with other Python libraries such as Numpy and Matplotlib, allowing you to seamlessly combine data manipulation, numerical computation, and data visualization tasks. 


### Importing Pandas

Before diving into Pandas's capabilities, we need to import it. Conventionally, we import Pandas with the alias `pd`, making it easier to reference its functions:

In [1]:
import numpy as np
import pandas as pd

## DataFrames

The main data structure we use in pandas is called the `dataframe`, which is basically a table so let's create a simple one. 

We will use a dataset that was prepared specifically for this session. It is a subset of a dataset that we will use for the next session where we'll be predicting the price of a car.

In [2]:
# Data is in the format of a list of lists
data = [
    ['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
    ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
    ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
    ['GMC', 'Acadia',  2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
    ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340],
]

# A separate variable that defines the columns
columns = [
    'Make', 'Model', 'Year', 'Engine HP', 'Engine Cylinders',
    'Transmission Type', 'Vehicle_Style', 'MSRP'
]

In [3]:
data

[['Nissan', 'Stanza', 1991, 138, 4, 'MANUAL', 'sedan', 2000],
 ['Hyundai', 'Sonata', 2017, None, 4, 'AUTOMATIC', 'Sedan', 27150],
 ['Lotus', 'Elise', 2010, 218, 4, 'MANUAL', 'convertible', 54990],
 ['GMC', 'Acadia', 2017, 194, 4, 'AUTOMATIC', '4dr SUV', 34450],
 ['Nissan', 'Frontier', 2017, 261, 6, 'MANUAL', 'Pickup', 32340]]

In [4]:
columns

['Make',
 'Model',
 'Year',
 'Engine HP',
 'Engine Cylinders',
 'Transmission Type',
 'Vehicle_Style',
 'MSRP']

In [5]:
# Turn the above formatted data into a DataFrame
df = pd.DataFrame(data, columns=columns)

In [6]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


Alternatively, we can use a list of dictionaries to create a `dataframe`:

In [7]:
data = [
    {
        "Make": "Nissan",
        "Model": "Stanza",
        "Year": 1991,
        "Engine HP": 138.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "sedan",
        "MSRP": 2000
    },
    {
        "Make": "Hyundai",
        "Model": "Sonata",
        "Year": 2017,
        "Engine HP": None,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "Sedan",
        "MSRP": 27150
    },
    {
        "Make": "Lotus",
        "Model": "Elise",
        "Year": 2010,
        "Engine HP": 218.0,
        "Engine Cylinders": 4,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "convertible",
        "MSRP": 54990
    },
    {
        "Make": "GMC",
        "Model": "Acadia",
        "Year": 2017,
        "Engine HP": 194.0,
        "Engine Cylinders": 4,
        "Transmission Type": "AUTOMATIC",
        "Vehicle_Style": "4dr SUV",
        "MSRP": 34450
    },
    {
        "Make": "Nissan",
        "Model": "Frontier",
        "Year": 2017,
        "Engine HP": 261.0,
        "Engine Cylinders": 6,
        "Transmission Type": "MANUAL",
        "Vehicle_Style": "Pickup",
        "MSRP": 32340
    }
]

In [8]:
# Here, we don't provide the names of columns. 
# It just infers the names of columns from the dictionaries.

df = pd.DataFrame(data)
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


The first thing to do after loading a `dataframe` from a csv file or from a sql query is to look at the data using the `df.head(n=num_of_rows)` method. 

In [9]:
# Retrieve the first couple of rows to inspect when reading 
# larger dataframe by using the df.head(n=num_of_rows) function.

# Return the first two rows.
df.head(n=2)

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150


## Series

Every column of this dataframe is a series. In pandas, dataframe is a table and the table consists of multiple series. Each column is a panda series and if we want to access a particular series in a dataframe, e.g., "Make" column, we use `df.Make`. 

In [10]:
# Return the 'Make' column of the dataframe as a series
df.Make

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [11]:
df.Engine HP

SyntaxError: invalid syntax (1897567212.py, line 1)

Another option to extract a particular series in a dataframe is to use the brackets notation instead of the dot notation. This is particularly useful in cases where the name of the column comes with whitespaces.

In [12]:
# Specify the name of the column we want to extract within the brackets.
df['Engine HP']

0    138.0
1      NaN
2    218.0
3    194.0
4    261.0
Name: Engine HP, dtype: float64

In [13]:
# Access multiple columns at the same time.

# Get a subset of our dataframe that contains only
# 'Make', 'Model', and 'MSRP' by putting a list inside the brackets.
df[['Make', 'Model', 'MSRP']]

Unnamed: 0,Make,Model,MSRP
0,Nissan,Stanza,2000
1,Hyundai,Sonata,27150
2,Lotus,Elise,54990
3,GMC,Acadia,34450
4,Nissan,Frontier,32340


In [14]:
# Add another column to this dataframe.

# Add a new column called 'id'.
df['id'] = [1, 2, 3, 4, 5]
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP,id
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000,1
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150,2
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990,3
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450,4
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340,5


In [15]:
# Replace the column with a different set of values. 
df['id'] = [10, 20, 30, 40, 50]

In [16]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP,id
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000,10
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150,20
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990,30
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450,40
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340,50


In [17]:
# Delete an existing column using the del operator.
# The syntax is similar to removing something from a dictionary.
del df['id']

In [18]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


## Index

In dataframe, the leftmost column are the ids (indices) of rows and this is how we can refer to each row. 

Using this index, we can access the elements of the dataframe.

In [19]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [20]:
df.Make

0     Nissan
1    Hyundai
2      Lotus
3        GMC
4     Nissan
Name: Make, dtype: object

In [21]:
# The index we see here is the same index that the dataframe has.
df.Make.index

RangeIndex(start=0, stop=5, step=1)

In [22]:
# Use df.loc[idx] to access the row of the element that is indexed by 1.
# 'loc' stands for location.
df.loc[1]

Make                   Hyundai
Model                   Sonata
Year                      2017
Engine HP                  NaN
Engine Cylinders             4
Transmission Type    AUTOMATIC
Vehicle_Style            Sedan
MSRP                     27150
Name: 1, dtype: object

In [23]:
# We can also return multiple rows using df.loc[[idx1, idx2]]
df.loc[[1, 2]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


In [24]:
# We can replace the default numerical index to something like alphabetical.
# E.g., 'a', 'b', 'c', 'd', 'e'. 
df.index = ['a', 'b', 'c', 'd', 'e']

In [25]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [26]:
# However, df.loc[[idx1, idx2]] would no longer works because there 
# are no records with these ids in the index. 
df.loc[[1, 2]]

KeyError: "None of [Index([1, 2], dtype='int64')] are in the [index]"

In [27]:
# We need to use 'b' and 'c' to refer to these particular records in 
# the dataframe. 
df.loc[['b', 'c']]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990


In [28]:
# However, we can still refer to elements using positional index 
# by using df.iloc[[idx1, idx2]] method. 
df.iloc[[1, 2, 4]]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [29]:
# Reset the modified index to default and create a new column 
# called index to keep the values from the old index. 
df.reset_index()

Unnamed: 0,index,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,a,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,b,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,c,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,d,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,e,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [30]:
# If we don't need the values of the old index, we add this parameter 
# called 'drop=True'. We also rewrite the dataframe with the new data by 
# reassigning df to the result after reset. 
df = df.reset_index(drop=True)

In [31]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


## Accessing elements

Refer to the previous sections covering `df.column_name`, `df['column_name']`, `df.loc[]` and `df.iloc[]`.

## Element-wise operations

We can do everything in numpy using pandas except that we're operating on series from pandas and not on numpy arrays.

The main difference between the two is that in pandas, you have an index and a name. Otherwise, it's pretty much the same. 

Under the hood, pandas actually uses numpy.

In [32]:
df['Engine HP']

0    138.0
1      NaN
2    218.0
3    194.0
4    261.0
Name: Engine HP, dtype: float64

In [33]:
# Similar to numpy array, we can apply element-wise operations 
# to each element of the column. 
df['Engine HP'] / 100

0    1.38
1     NaN
2    2.18
3    1.94
4    2.61
Name: Engine HP, dtype: float64

In [34]:
# Multiply each element of the column by 2.
df['Engine HP'] * 2

0    276.0
1      NaN
2    436.0
3    388.0
4    522.0
Name: Engine HP, dtype: float64

In [35]:
# Pandas also support logical operators.

# Find all the records of cars that were created in 2015 onwards.
df['Year'] >= 2015

0    False
1     True
2    False
3     True
4     True
Name: Year, dtype: bool

## Filtering

In [36]:
# Apply filtering to the dataframe with the previous condition.
# df[condition]

# Look at all the records of cars that were manufactured in year 2015 and after.
df[df['Year'] >= 2015]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
1,Hyundai,Sonata,2017,,4,AUTOMATIC,Sedan,27150
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr SUV,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [37]:
# Find all cars that are made by Nissan.
df[
    df['Make'] == 'Nissan'
]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


In [38]:
# Combine conditions using the logical operator '&' and wrap each of the condition using round brackets.
df[
    (df['Make'] == 'Nissan') & (df['Year'] >= 2015)
]

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
4,Nissan,Frontier,2017,261.0,6,MANUAL,Pickup,32340


## String operations

Numpy does not support string operations as it is mostly used for processing numbers. However, in pandas, we will often need to handle strings. 

Here, we'll cover a couple of useful things for manipulating strings.

In [39]:
# Would be nice to standardize the string format 
# (i.e., all uppercase, all lowercase, all starts with capital)
df['Vehicle_Style']

0          sedan
1          Sedan
2    convertible
3        4dr SUV
4         Pickup
Name: Vehicle_Style, dtype: object

In [40]:
# Use df['column_name'].str to invoke str methods on the entire series.

# Set the returned string of the entire series of the column to lowercase.
df['Vehicle_Style'].str.lower()

0          sedan
1          sedan
2    convertible
3        4dr suv
4         pickup
Name: Vehicle_Style, dtype: object

In [41]:
# Replace all the whitespaces to underscores. 
'machine learning zoomcamp'.replace(' ', '_')

'machine_learning_zoomcamp'

In [42]:
df['Vehicle_Style'].str.replace(' ', '_')

0          sedan
1          Sedan
2    convertible
3        4dr_SUV
4         Pickup
Name: Vehicle_Style, dtype: object

In [43]:
# Chain the string operations.
df['Vehicle_Style'].str.replace(' ', '_').str.lower()

0          sedan
1          sedan
2    convertible
3        4dr_suv
4         pickup
Name: Vehicle_Style, dtype: object

In [44]:
# Replace/overwrite the series with the clean version by reassigning it back 
# to the series. 
df['Vehicle_Style'] = df['Vehicle_Style'].str.replace(' ', '_').str.lower()

In [45]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


## Summarizing operations

Like in numpy, pandas also support summarizing operations.

In [46]:
# Minimum value
df.MSRP.min()
df['MSRP'].min()

np.int64(2000)

In [47]:
# Maximum value
df.MSRP.max()
df['MSRP'].max()

np.int64(54990)

In [48]:
# Mean
df.MSRP.mean()
df['MSRP'].mean()

np.float64(30186.0)

In [49]:
# describe() function reports all the useful statistics like 
# count, mean, std, min, max, percentiles. 
df.MSRP.describe()
df['MSRP'].describe()

count        5.000000
mean     30186.000000
std      18985.044904
min       2000.000000
25%      27150.000000
50%      32340.000000
75%      34450.000000
max      54990.000000
Name: MSRP, dtype: float64

In [50]:
# Report useful summary statistics on all numerical columns of the dataframe. 
df.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.260551,51.29896,0.894427,18985.044904
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


In [51]:
# Use round() function to round the values up to two decimal points
# to make it compact.
df.describe().round(2)

Unnamed: 0,Year,Engine HP,Engine Cylinders,MSRP
count,5.0,4.0,5.0,5.0
mean,2010.4,202.75,4.4,30186.0
std,11.26,51.3,0.89,18985.04
min,1991.0,138.0,4.0,2000.0
25%,2010.0,180.0,4.0,27150.0
50%,2017.0,206.0,4.0,32340.0
75%,2017.0,228.75,4.0,34450.0
max,2017.0,261.0,6.0,54990.0


In [52]:
# For categorical variables (e.g., strings), we can use 
# nunique() function to understand how many unique values are there.
df.nunique()

Make                 4
Model                5
Year                 3
Engine HP            4
Engine Cylinders     2
Transmission Type    2
Vehicle_Style        4
MSRP                 5
dtype: int64

In [53]:
df.Make.nunique()

4

## Missing values

When it comes to machine learning, we don't want to have these missing values (`NaN`). 

We want to know how many of them are `NaN` and do something with them. For that, we can use `df.isnull()` function to return a new dataframe where for each cell, it says True if a value is `NaN`, else False.

In [54]:
df.isnull()

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False


In [55]:
# Understand for each column, how many missing values are there.
# sum() function sums across columns so it applies the sum to each column.
df.isnull().sum()

Make                 0
Model                0
Year                 0
Engine HP            1
Engine Cylinders     0
Transmission Type    0
Vehicle_Style        0
MSRP                 0
dtype: int64

## Grouping


We want to see what is the mean price for all the 'MANUAL' and 'AUTOMATIC' transmission types.

SQL query example to compute the average price per transmission type:
```sql
SELECT 
    transmission_type,
    AVG(MSRP)
FROM
    cars
GROUP BY
    transmission_type
```

Let's translate the SQL query to pandas.

In pandas, we have a method called `df.groupby('column_name')`. In the method argument, we specify the column to group by. 

In [56]:
# Compute the mean price per transmission type.
df.groupby('Transmission Type').MSRP.mean()

Transmission Type
AUTOMATIC    30800.000000
MANUAL       29776.666667
Name: MSRP, dtype: float64

In [57]:
# Round the result to 2 decimal point.
df.groupby('Transmission Type').MSRP.mean().round(2)

Transmission Type
AUTOMATIC    30800.00
MANUAL       29776.67
Name: MSRP, dtype: float64

In [58]:
# Compute the min price per transmission type.
df.groupby('Transmission Type').MSRP.min()

Transmission Type
AUTOMATIC    27150
MANUAL        2000
Name: MSRP, dtype: int64

In [59]:
# Compute the max price per transmission type.
df.groupby('Transmission Type').MSRP.max()

Transmission Type
AUTOMATIC    34450
MANUAL       54990
Name: MSRP, dtype: int64

## Getting the NumPy arrays

Everything in pandas is backed by numpy.

In [60]:
# Return a series in pandas
df.MSRP

0     2000
1    27150
2    54990
3    34450
4    32340
Name: MSRP, dtype: int64

Return the underlying numpy array by adding `.values` to the pandas series.

In [61]:
df.MSRP.values

array([ 2000, 27150, 54990, 34450, 32340])

In [62]:
df

Unnamed: 0,Make,Model,Year,Engine HP,Engine Cylinders,Transmission Type,Vehicle_Style,MSRP
0,Nissan,Stanza,1991,138.0,4,MANUAL,sedan,2000
1,Hyundai,Sonata,2017,,4,AUTOMATIC,sedan,27150
2,Lotus,Elise,2010,218.0,4,MANUAL,convertible,54990
3,GMC,Acadia,2017,194.0,4,AUTOMATIC,4dr_suv,34450
4,Nissan,Frontier,2017,261.0,6,MANUAL,pickup,32340


In [63]:
# Convert a pandas dataframe back to a list of dictionaries format so that you can save it to a file.
# Specify that you want to do this per record. 

df.to_dict(orient='records')

[{'Make': 'Nissan',
  'Model': 'Stanza',
  'Year': 1991,
  'Engine HP': 138.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'sedan',
  'MSRP': 2000},
 {'Make': 'Hyundai',
  'Model': 'Sonata',
  'Year': 2017,
  'Engine HP': nan,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': 'sedan',
  'MSRP': 27150},
 {'Make': 'Lotus',
  'Model': 'Elise',
  'Year': 2010,
  'Engine HP': 218.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'convertible',
  'MSRP': 54990},
 {'Make': 'GMC',
  'Model': 'Acadia',
  'Year': 2017,
  'Engine HP': 194.0,
  'Engine Cylinders': 4,
  'Transmission Type': 'AUTOMATIC',
  'Vehicle_Style': '4dr_suv',
  'MSRP': 34450},
 {'Make': 'Nissan',
  'Model': 'Frontier',
  'Year': 2017,
  'Engine HP': 261.0,
  'Engine Cylinders': 6,
  'Transmission Type': 'MANUAL',
  'Vehicle_Style': 'pickup',
  'MSRP': 32340}]

## Conclusion

Whether you are working on a small data analysis project or dealing with large datasets, Pandas provides efficient and intuitive tools to handle your data. Its extensive [documentation](https://pandas.pydata.org/docs/) and active community support make it a popular choice among data scientists and analysts.  

### Further Readings
- [Machine Learning Bookcamp Pandas Notebook](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/appendix-d-pandas.ipynb)
- [Machine Learning Zoomcamp 2023 notes from Peter Ernicke](https://knowmledge.com/2023/09/16/ml-zoomcamp-2023-introduction-to-machine-learning-part-12/)
- [Pandas Cheatsheet](https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python)