#### Pandas: DataFrame and Series

Pandas is a popular Python library for data manipulation and analysis. It provides easy-to-use data structures and functions that allow for fast and flexible data handling, making it a go-to tool for data analysis and data cleaning tasks.

**Key Data Structures in Pandas:** 

1. **Series**:
   - A **Series** is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, etc.).
   - It is similar to a column in a spreadsheet or a single column in a DataFrame.
   - Each element in a Series has an associated label, called an **index**.


In [3]:
import pandas as pd
data = ["Rajat","Simba","Rajat","Simba","Rajat","Simba","Rajat","Simba","Rajat","Simba"]
pdSeries = pd.Series(data)
print(pdSeries) # Indexing is already done by pandas
type(pdSeries) # pandas.core.series.Series

0    Rajat
1    Simba
2    Rajat
3    Simba
4    Rajat
5    Simba
6    Rajat
7    Simba
8    Rajat
9    Simba
dtype: object


pandas.core.series.Series

In [8]:
# Create a series from Dictionary
data = {"Name":"Rajat","Age":18,"City":"Greater Noida"}
pdSeries = pd.Series(data)
print(pdSeries) # Key is used as index in this case

print("-----------------")

# Create a series from List with custom index
data = ["Rajat","Simba","Rajat","Simba"]
index = ["a","b","c","d"] # Note Index = data length
pdSeries = pd.Series(data,index=index)
print(pdSeries) # Custom index is used in this case


Name            Rajat
Age                18
City    Greater Noida
dtype: object
-----------------
a    Rajat
b    Simba
c    Rajat
d    Simba
dtype: object


2. **DataFrame**:
   - A **DataFrame** is a two-dimensional, size-mutable, and heterogeneous data structure with labeled axes (rows and columns).
   - It is similar to a spreadsheet or a SQL table, or a dictionary of Series objects.
   - A DataFrame can store different types of data (such as integers, floats, and strings) in different columns.




In [4]:
import pandas as pd

# Create a Dataframe from a Dictionary
data = {
    "Name":["Rajat","Simba","Rajat","Simba"],
    "Age":[18,3,18,3],
    "City":["Greater Noida","Noida","Greater Noida","Noida"]
}
pdDataFrame = pd.DataFrame(data)
print(pdDataFrame) # Indexing is already done by pandas
print(type(pdDataFrame)) # pandas.core.frame.DataFrame

<class 'dict'>
    Name  Age           City
0  Rajat   18  Greater Noida
1  Simba    3          Noida
2  Rajat   18  Greater Noida
3  Simba    3          Noida
<class 'pandas.core.frame.DataFrame'>


In [7]:
# Create a DataFrame from List of Dictionary
data = [
    {"Name":["Rajat","Raj"],"Age":18,"City":"Greater Noida"},
    {"Name":"Simba","Age":3,"City":"Noida"},
    {"Name":"Rajat","Age":18,"City":"Greater Noida"},
    {"Name":"Simba","Age":3,"City":"Noida"}
]
pdDataFrame = pd.DataFrame(data)
print(pdDataFrame) 

           Name  Age           City
0  [Rajat, Raj]   18  Greater Noida
1         Simba    3          Noida
2         Rajat   18  Greater Noida
3         Simba    3          Noida


In [15]:
# Reading a CSV file
data = pd.read_csv("x1.csv")
data.head(3) # First 3 rows
# print(data) # To print all the rows



Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Unit Price,Total Revenue,Region,Payment Method
0,10001,2024-01-01,Electronics,iPhone 14 Pro,2,999.99,1999.98,North America,Credit Card
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,69.99,209.97,Asia,Debit Card


In [53]:
# Accessing Above data using pandas

# Columns
# print(data["Product Name"]) # Accessing a single column
# print(data[["Product Name","Unit Price"]]) # Accessing multiple columns

# Rows - iloc and loc

# .iloc: Selects data by integer positions (like array indexing).
# data.iloc[0] # Accessing a single row
# data.iloc[0:3] # Accessing multiple rows
# data.iloc[0,3] # Accessing a single cell of the 0th index row
# data.iloc[0:3,0:3] # Accessing multiple cells of the 0th index row

# .loc: Selects data by labels (name/index).
# data.loc[0,"Product Name"] # Accessing a single cell of the 0th index row
# data.loc[0:3,"Product Name"] # Accessing multiple rows with a single column
# data.loc[0:3,["Product Name","Unit Price"]] # Accessing multiple cells of different rows
# data.loc[data["Unit Price"]>1000] # Accessing rows based on condition
# data.loc[data['Transaction ID'] == 10008, 'Product Name']

# Accessing a single cell
data.at[0,"Product Name"]
data.iat[0,3] # Row index, Column index

'iPhone 14 Pro'

In [4]:
import pandas as pd
# Data Manipulation with Pandas - Adding, Removing, Changing Columns and Rows

# Reading a CSV file
data = pd.read_csv("x1.csv")

# Adding a new column
data["x"] = "Rajat" # Changing the value of a column
data.head(3)

# Remove a column
data = data.drop("x", axis=1)
data.head(3)

# inplace = True will make the changes in the original dataframe itself 
# and will not return anything.
data.drop("Unit Price", axis=1, inplace=True)
data.head(3)

# Drop a row
data.drop(0, axis=0, inplace=True)
data.head(3)

Unnamed: 0,Transaction ID,Date,Product Category,Product Name,Units Sold,Total Revenue,Region,Payment Method
1,10002,2024-01-02,Home Appliances,Dyson V11 Vacuum,1,499.99,Europe,PayPal
2,10003,2024-01-03,Clothing,Levi's 501 Jeans,3,209.97,Asia,Debit Card
3,10004,2024-01-04,Books,The Da Vinci Code,4,63.96,North America,Credit Card


In [5]:
# More Methods for Data Manipulation
data.describe() # Summary of the data
data.info() # Information about the data
data.shape # Shape of the data
data.columns # Columns of the data
data.index # Index of the data
data.tail(5) # last 5 rows
data.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239 entries, 1 to 239
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction ID    239 non-null    int64  
 1   Date              239 non-null    object 
 2   Product Category  239 non-null    object 
 3   Product Name      239 non-null    object 
 4   Units Sold        239 non-null    int64  
 5   Total Revenue     239 non-null    float64
 6   Region            239 non-null    object 
 7   Payment Method    239 non-null    object 
dtypes: float64(1), int64(2), object(5)
memory usage: 15.1+ KB


Transaction ID        int64
Date                 object
Product Category     object
Product Name         object
Units Sold            int64
Total Revenue       float64
Region               object
Payment Method       object
dtype: object