# **Pandas DataFrame Fundamentals**
This provides a foundational understanding of DataFrames, a core data structure in the pandas library, essential for data manipulation and analysis in Python.

### **What is a DataFrame?**
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it as:
1. A spreadsheet (like Excel)
1. A SQL table
1. A dictionary of Series objects

It's the most commonly used pandas object, and it allows you to store and manipulate data in a structured way.

### **Key Characteristics of DataFrames**
1. Labeled Axes: Both rows and columns have labels (indices for rows, column names for columns). This makes data access intuitive.
1.Heterogeneous Data: Columns can have different data types (e.g., one column can be integers, another strings, and another booleans).
1. Size Mutable: You can add or remove columns and rows.
1. Value Mutable: Data within the DataFrame can be changed.

### **Creating DataFrames**
DataFrames can be created in various ways, most commonly from:
1. Dictionaries of lists or Series
1. Lists of dictionaries
1. NumPy arrays
1. CSV or other external files

Let's look at some examples of how to create DataFrames.


### **Viewing Data**

In [None]:
# Import Pandas library
import pandas as pd

In [None]:
# Load to dataframe df the csv file SalaryData.csv
df = pd.read_csv('SalaryData.csv')
# Print df
df

Unnamed: 0,Name,Age,City,Occupation,Salary
0,Alice,25,New York,Engineer,70000
1,Bob,30,Los Angeles,Artist,85000
2,Charlie,35,Chicago,Doctor,120000
3,David,28,New York,Engineer,75000
4,Eve,32,Houston,Scientist,90000
5,Frank,40,Miami,Manager,110000
6,Grace,22,Boston,Designer,60000


In [None]:
# Display the first five rows
df.head()

Unnamed: 0,Name,Age,City,Occupation,Salary
0,Alice,25,New York,Engineer,70000
1,Bob,30,Los Angeles,Artist,85000
2,Charlie,35,Chicago,Doctor,120000
3,David,28,New York,Engineer,75000
4,Eve,32,Houston,Scientist,90000


In [None]:
# Display the last five rows
df.tail()

Unnamed: 0,Name,Age,City,Occupation,Salary
2,Charlie,35,Chicago,Doctor,120000
3,David,28,New York,Engineer,75000
4,Eve,32,Houston,Scientist,90000
5,Frank,40,Miami,Manager,110000
6,Grace,22,Boston,Designer,60000


In [None]:
# Display concise summary of df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        7 non-null      object
 1   Age         7 non-null      int64 
 2   City        7 non-null      object
 3   Occupation  7 non-null      object
 4   Salary      7 non-null      int64 
dtypes: int64(2), object(3)
memory usage: 412.0+ bytes


In [None]:
# Generate descriptive statistics
df.describe()

Unnamed: 0,Age,Salary
count,7.0,7.0
mean,30.285714,87142.857143
std,6.074929,21574.89723
min,22.0,60000.0
25%,26.5,72500.0
50%,30.0,85000.0
75%,33.5,100000.0
max,40.0,120000.0


In [None]:
# Shows the dimensions of the df (number of rows columns)
df.shape

(7, 5)

### **Selecting Columns**

In [None]:
# Select one column: Name
df['Name']

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David
4,Eve
5,Frank
6,Grace


In [None]:
# Display multiple columns
df[['Name', 'Age']]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35
3,David,28
4,Eve,32
5,Frank,40
6,Grace,22


### **Adding and Modifying Columns**

In [None]:
# Add new column: Experience
df['Experience'] = [2, 5, 10, 3, 7, 15, 1]
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,25,New York,Engineer,70000,2
1,Bob,30,Los Angeles,Artist,85000,5
2,Charlie,35,Chicago,Doctor,120000,10
3,David,28,New York,Engineer,75000,3
4,Eve,32,Houston,Scientist,90000,7
5,Frank,40,Miami,Manager,110000,15
6,Grace,22,Boston,Designer,60000,1


In [None]:
# Modify value of existing column: Increment age by 1 year
df['Age'] = df ['Age'] + 1
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience,Seniority
0,Alice,27,New York,Engineer,70000,2,Junior
1,Bob,32,Los Angeles,Artist,85000,5,Senior
2,Charlie,37,Chicago,Doctor,120000,10,Senior
3,David,30,New York,Engineer,75000,3,Junior
4,Eve,34,Houston,Scientist,90000,7,Senior
5,Frank,42,Miami,Manager,110000,15,Senior
6,Grace,24,Boston,Designer,60000,1,Junior


In [None]:
# Add new column based on condition: Seniority where >=20 is 'Senior', others is 'Junior'
df['Seniority'] = ['Senior' if x >= 30 else 'Junior' for x in df['Age']]
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience,Seniority
0,Alice,27,New York,Engineer,70000,2,Junior
1,Bob,32,Los Angeles,Artist,85000,5,Senior
2,Charlie,37,Chicago,Doctor,120000,10,Senior
3,David,30,New York,Engineer,75000,3,Senior
4,Eve,34,Houston,Scientist,90000,7,Senior
5,Frank,42,Miami,Manager,110000,15,Senior
6,Grace,24,Boston,Designer,60000,1,Junior


### **Deleting Columns**

In [None]:
# Use drop() to delete a column
# Create a data frame df_no_seniority where Seniority is no longer included
df_no_seniority = df.drop('Seniority' , axis=1)
df
df_no_seniority

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,27,New York,Engineer,70000,2
1,Bob,32,Los Angeles,Artist,85000,5
2,Charlie,37,Chicago,Doctor,120000,10
3,David,30,New York,Engineer,75000,3
4,Eve,34,Houston,Scientist,90000,7
5,Frank,42,Miami,Manager,110000,15
6,Grace,24,Boston,Designer,60000,1


In [None]:
# Delete column using del
del(df['Seniority'])
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,27,New York,Engineer,70000,2
1,Bob,32,Los Angeles,Artist,85000,5
2,Charlie,37,Chicago,Doctor,120000,10
3,David,30,New York,Engineer,75000,3
4,Eve,34,Houston,Scientist,90000,7
5,Frank,42,Miami,Manager,110000,15
6,Grace,24,Boston,Designer,60000,1


### **Indexing and Selection**

In [None]:
# Label based indexing
df.loc[0]

Unnamed: 0,0
Name,Alice
Age,27
City,New York
Occupation,Engineer
Salary,70000
Experience,2


In [None]:
# Label based multiple rows
df.loc[[1, 3]]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
1,Bob,32,Los Angeles,Artist,85000,5
3,David,30,New York,Engineer,75000,3


In [None]:
# Using label range
df.loc[:3]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,27,New York,Engineer,70000,2
1,Bob,32,Los Angeles,Artist,85000,5
2,Charlie,37,Chicago,Doctor,120000,10
3,David,30,New York,Engineer,75000,3


In [None]:
# Display rows with specific column
df.loc[:,['Name', 'Salary']]

Unnamed: 0,Name,Salary
0,Alice,70000
1,Bob,85000
2,Charlie,120000
3,David,75000
4,Eve,90000
5,Frank,110000
6,Grace,60000


In [None]:
# Display specific rows and specific columns
df.loc[2:5, ['Name','Age', 'Salary']]

Unnamed: 0,Name,Age,Salary
2,Charlie,37,120000
3,David,30,75000
4,Eve,34,90000
5,Frank,42,110000


In [None]:
# Update values using labels
# Update salary of row 0 from 0 to 72000
df.loc[0, ['Salary']] = 72000
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,27,New York,Engineer,72000,2
1,Bob,32,Los Angeles,Artist,85000,5
2,Charlie,37,Chicago,Doctor,120000,10
3,David,30,New York,Engineer,75000,3
4,Eve,34,Houston,Scientist,90000,7
5,Frank,42,Miami,Manager,110000,15
6,Grace,24,Boston,Designer,60000,1


In [None]:
# Using the integer-based location .iloc
# Display first row integer-based
df.iloc[0]

Unnamed: 0,0
Name,Alice
Age,27
City,New York
Occupation,Engineer
Salary,72000
Experience,2


In [None]:
df.iloc[3:5]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
3,David,30,New York,Engineer,75000,3
4,Eve,34,Houston,Scientist,90000,7


In [None]:
# Integer-based displaying specific rows and specific columns
# Display 2nd row to 3rd row and columns 0 to 4
df.iloc[1:3, 0:5]

Unnamed: 0,Name,Age,City,Occupation,Salary
1,Bob,32,Los Angeles,Artist,85000
2,Charlie,37,Chicago,Doctor,120000


In [None]:
# Updating values using integer based
df.iloc[0,4] = 73000
df

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,27,New York,Engineer,73000,2
1,Bob,32,Los Angeles,Artist,85000,5
2,Charlie,37,Chicago,Doctor,120000,10
3,David,30,New York,Engineer,75000,3
4,Eve,34,Houston,Scientist,90000,7
5,Frank,42,Miami,Manager,110000,15
6,Grace,24,Boston,Designer,60000,1


### **Filtering Data**

In [None]:
# Filter rows where age is greater than or equal to 30
df[df['Age']>=30]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
1,Bob,32,Los Angeles,Artist,85000,5
2,Charlie,37,Chicago,Doctor,120000,10
3,David,30,New York,Engineer,75000,3
4,Eve,34,Houston,Scientist,90000,7
5,Frank,42,Miami,Manager,110000,15


In [None]:
# Filter rows where Occupation is 'Engineer' and city 'New York'
df[(df['Occupation']=='Engineer') & (df['City']=='New York')]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,27,New York,Engineer,73000,2
3,David,30,New York,Engineer,75000,3


In [None]:
# Filter names that contain 'Alice' or 'Bob'
df[df['Name'].isin(['Alice', 'Bob'])]

Unnamed: 0,Name,Age,City,Occupation,Salary,Experience
0,Alice,27,New York,Engineer,73000,2
1,Bob,32,Los Angeles,Artist,85000,5


### **Handling Missing Values**

In [None]:
df_sales = pd.read_csv('SalesCSV.csv')
df_sales

Unnamed: 0,OrderID,CustomerName,ProductCategory,Sales,Quantity,Region,OrderDate
0,1001,Alice Smith,Electronics,1200.5,2.0,North,1/15/2023
1,1002,Bob Johnson,Apparel,45.75,1.0,South,1/16/2023
2,1003,Charlie Brown,Home Goods,250.0,3.0,East,1/17/2023
3,1001,Alice Smith,Electronics,1200.5,2.0,North,1/15/2023
4,1005,Diana Prince,,150.0,,West,1/18/2023
5,1006,Eve Adams,Apparel,,1.0,North,1/19/2023
6,1007,Frank White,Electronics,800.2,,South,1/20/2023
7,1008,Grace Lee,Home Goods,75.5,5.0,East,1/21/2023
8,1009,Henry King,Electronics,,2.0,West,1/22/2023
9,1010,Ivy Green,,300.0,1.0,North,1/23/2023


In [None]:
# Check missing values
df_sales.isnull().sum()

Unnamed: 0,0
OrderID,0
CustomerName,0
ProductCategory,3
Sales,3
Quantity,4
Region,0
OrderDate,0


In [None]:
# Drop duplicates
df_sales.drop_duplicates(inplace=True)
df_sales

Unnamed: 0,OrderID,CustomerName,ProductCategory,Sales,Quantity,Region,OrderDate
0,1001,Alice Smith,Electronics,1200.5,2.0,North,1/15/2023
1,1002,Bob Johnson,Apparel,45.75,1.0,South,1/16/2023
2,1003,Charlie Brown,Home Goods,250.0,3.0,East,1/17/2023
4,1005,Diana Prince,,150.0,,West,1/18/2023
5,1006,Eve Adams,Apparel,,1.0,North,1/19/2023
6,1007,Frank White,Electronics,800.2,,South,1/20/2023
7,1008,Grace Lee,Home Goods,75.5,5.0,East,1/21/2023
8,1009,Henry King,Electronics,,2.0,West,1/22/2023
9,1010,Ivy Green,,300.0,1.0,North,1/23/2023
10,1011,Jack Black,Apparel,90.25,,South,1/24/2023


In [None]:
# Drop rows with missing values
df_sales = df_sales.dropna()
df_sales

Unnamed: 0,OrderID,CustomerName,ProductCategory,Sales,Quantity,Region,OrderDate
0,1001,Alice Smith,Electronics,1200.5,2.0,North,1/15/2023
1,1002,Bob Johnson,Apparel,45.75,1.0,South,1/16/2023
2,1003,Charlie Brown,Home Goods,250.0,3.0,East,1/17/2023
7,1008,Grace Lee,Home Goods,75.5,5.0,East,1/21/2023
11,1012,Karen Blue,Home Goods,180.0,4.0,East,1/25/2023
12,1013,Liam Scott,Electronics,600.0,2.0,West,1/26/2023
