## Pandas:  Series & DataFrame data structures


This notebook covers:
- **Pandas Data Structures**: Series & DataFrame
- **Handling DataFrames**: Loading, Inspecting, and Modifying Data
- **Handling Missing Data**: Strategies to deal with NaN values
- **Grouping and Aggregation**: Summarizing data
- **Data Visualization**: Analyzing trends visually

##  **Introduction** : Pandas

Pandas is  an open-source  powerful Python library for data manipulation and analysis. 

It offers powerful, flexible data structures such as Series (1D) and DataFrame (2D).

# âœ… Why Use Pandas?
### - Easy handling of missing data
### - Powerful group-by functionality
### - Fast and efficient merging, reshaping, slicing, and dicing of data
### - Read/write support for many file formats: CSV, Excel, JSON, SQL, etc.


##  **Understanding Pandas Series , DataFrame Data Structures**

### **Definition & Functionality**
- **Series**: A one-dimensional labeled array, similar to a column in a spreadsheet.

- **DataFrame**: A two-dimensional table-like structure with labeled rows and columns, similar to an Excel sheet.

In [2]:
import pandas as pd
import numpy as np

### Creating Series manually

In [3]:
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Series:\n", s)

Series:
 a    10
b    20
c    30
d    40
dtype: int64


In [23]:
prices = [649000, 391000, 5476000, 1786000, 1091000]
carnames = ['swift', 'santro', 'audi', 'elantra', 'bolero']

In [24]:
car_series = pd.Series(data = prices, index = carnames)
car_series

swift       649000
santro      391000
audi       5476000
elantra    1786000
bolero     1091000
dtype: int64

In [25]:
type(car_series)

pandas.core.series.Series

In [26]:
car_series = pd.Series(prices)
car_series

0     649000
1     391000
2    5476000
3    1786000
4    1091000
dtype: int64

In [27]:
car_series = pd.Series(prices, carnames, name = 'price')
car_series

swift       649000
santro      391000
audi       5476000
elantra    1786000
bolero     1091000
Name: price, dtype: int64

In [73]:
# The `name` attribute for the Pandas series is optional but helpful and
# makes the data organization neater and easier to handle.

### Example
Creating series from dictionaries

In [32]:
entries = {'swift': 649000,
           'santro': '391000',
           'audi': 5476000,
           'elantra': 1786000,
           'bolero': 1091000}
car_series = pd.Series(data = entries, name = 'price')
car_series

swift       649000
santro      391000
audi       5476000
elantra    1786000
bolero     1091000
Name: price, dtype: object

In [33]:
type(car_series['swift'])

int

In [34]:
type(car_series['santro']) # as santro price is string data type

str

In [35]:
len(car_series)

5

In [36]:
car_series.shape

(5,)

### Accessing data from series

In this section, we will study various methods to access data from series.

### Example
Accessing data from series using logical conditions

In [37]:
entries = {'swift': 649000,
           'santro': 391000,
           'audi': 5476000,
           'elantra': 1786000,
           'bolero': 1091000}
car_series = pd.Series(data = entries, name = 'price')
car_series

swift       649000
santro      391000
audi       5476000
elantra    1786000
bolero     1091000
Name: price, dtype: int64

In [41]:
### display all model of car having price more than 10,00,000 rupees (10 Lacs )

In [42]:
car_series > 1000000

swift      False
santro     False
audi        True
elantra     True
bolero      True
Name: price, dtype: bool

In [43]:
car_series[car_series>1000000]

audi       5476000
elantra    1786000
bolero     1091000
Name: price, dtype: int64

In [45]:
car_series[car_series > 1000000].index[0]

'audi'

In [46]:
car_series[car_series > 1000000].index[2]

'bolero'

In [47]:
car_series[car_series > 1000000].values[0]

5476000

In [48]:
car_series[car_series > 1000000].values[2]

1091000

In [49]:
car_series[(car_series > 1000000) & (car_series < 2000000)]

elantra    1786000
bolero     1091000
Name: price, dtype: int64

### Quiz time 

### Quiz
Consider the series shown below:
```
cust_names = ['Hemang', 'Farheen', 'Himadri', 'Monisha']
cust_bill = [256.78, 434.53, 109.25, 529.42]
cust_info = pd.Series(cust_bill, cust_names)
```
Write code to print the names of the customers who have spent more than 300 rupees.

In [50]:
cust_names = ['Mahesh', 'Farheen', 'Himadri', 'Monisha']
cust_bill = [256.78, 434.53, 109.25, 529.42]
cust_info = pd.Series(cust_bill, cust_names)
print(list(cust_info[cust_info > 300].index))

['Farheen', 'Monisha']


### Example
Accessing data from series using the `.loc[]` method

In [52]:
entries = {'swift': 649000,
           'santro': 391000,
           'audi': 5476000,
           'elantra': 1786000,
           'bolero': 1091000}
car_series = pd.Series(data = entries, name = 'price')
car_series

swift       649000
santro      391000
audi       5476000
elantra    1786000
bolero     1091000
Name: price, dtype: int64

In [53]:
car_series.loc['swift']

649000

In [54]:
car_series.loc[['swift']]

swift    649000
Name: price, dtype: int64

In [55]:
car_series.loc[['swift', 'audi']]

swift     649000
audi     5476000
Name: price, dtype: int64

In [56]:
car_series.loc['swift':'elantra']

swift       649000
santro      391000
audi       5476000
elantra    1786000
Name: price, dtype: int64

### Note that this is similar to NumPy array slicing, but the `.loc[]` method is inclusive of the stop value as well.

In [57]:
car_series.loc['santro':'audi']

santro     391000
audi      5476000
Name: price, dtype: int64

In [58]:
car_series.loc[:'elantra']


swift       649000
santro      391000
audi       5476000
elantra    1786000
Name: price, dtype: int64

### Example
Accessing data from series using the `.iloc[]` method

In [59]:
entries = {'swift': 649000,
           'santro': 391000,
           'audi': 5476000,
           'elantra': 1786000,
           'bolero': 1091000}
car_series = pd.Series(data = entries, name = 'price')
car_series

swift       649000
santro      391000
audi       5476000
elantra    1786000
bolero     1091000
Name: price, dtype: int64

In [60]:
car_series.iloc[0]

649000

In [61]:
car_series.iloc[3]

1786000

In [62]:
car_series.iloc[[3]]

elantra    1786000
Name: price, dtype: int64

In [63]:
car_series.iloc[[0, 2, 4]]

swift      649000
audi      5476000
bolero    1091000
Name: price, dtype: int64

In [64]:
car_series.iloc[0:2]

swift     649000
santro    391000
Name: price, dtype: int64

Note that the `.iloc[]` method is not inclusive of the stop element like the `.loc[]` method. The `.iloc[]` method is very similar to NumPy array indexing and slicing.

In [65]:
car_series.iloc[-1]

1091000

In [66]:
car_series.iloc[[-1]]

bolero    1091000
Name: price, dtype: int64

In [67]:
car_series.iloc[1:5:2]

santro      391000
elantra    1786000
Name: price, dtype: int64

In [68]:
car_series.iloc[::-1]

bolero     1091000
elantra    1786000
audi       5476000
santro      391000
swift       649000
Name: price, dtype: int64

### Quiz
Consider the series shown below:
```
cust_names = ['Mahesh', 'Farheen', 'Himadri', 'Monisha']
cust_bill = [256.78, 434.53, 109.25, 529.42]
cust_info = pd.Series(cust_bill, cust_names)
```
Use the different methods you have studied to extract the bill amounts for Chad and Monisha.

In [70]:
cust_names = ['Mahesh', 'Farheen', 'Himadri', 'Monisha']
cust_bill = [256.78, 434.53, 109.25, 529.42]
cust_info = pd.Series(cust_bill, cust_names)

In [72]:
cust_info[['Mahesh', 'Monisha']]

Mahesh     256.78
Monisha    529.42
dtype: float64

## **Handling DataFrames**

### **Definition & Functionality**
- **Loading Data**: Import CSV, Excel, or SQL files into Pandas DataFrames.
- **Inspecting Data**: View structure, columns, and basic statistics.
- **Modifying Data**: Add, rename, or remove columns.

### Creating DataFrame manually

In [5]:

df = pd.DataFrame({
    'Product': ['A', 'B', 'C'],
    'Price': [100, 150, 200],
    'Quantity': [5, 3, 4]
})
print("\nDataFrame:\n", df)


DataFrame:
   Product  Price  Quantity
0       A    100         5
1       B    150         3
2       C    200         4


### How to create csv file from exisiting dataframe data



In [2]:
import pandas as pd

In [3]:
df_csv = pd.DataFrame({
    'Product': ['Soap', 'Shampoo', 'Toothpaste', 'Oil'],
    'Price': [30, 120, 45, 150],
    'Units_Sold': [100, 60, 80, 50]
})
df_csv

Unnamed: 0,Product,Price,Units_Sold
0,Soap,30,100
1,Shampoo,120,60
2,Toothpaste,45,80
3,Oil,150,50


In [4]:
# Save to CSV for use (in real scenario you would read from file)
df_csv.to_csv("retail_sales.csv", index=False)

print('file created successfully')

file created successfully


### Read CSV using pandas


In [5]:
sales_data = pd.read_csv("retail_sales.csv")

print("\n CSV Data from 'retail_sales.csv':\n\n", sales_data)



 CSV Data from 'retail_sales.csv':

       Product  Price  Units_Sold
0        Soap     30         100
1     Shampoo    120          60
2  Toothpaste     45          80
3         Oil    150          50


In [6]:
sales_data

Unnamed: 0,Product,Price,Units_Sold
0,Soap,30,100
1,Shampoo,120,60
2,Toothpaste,45,80
3,Oil,150,50


In [7]:
print("\nDisplay Price column:")

sales_data['Price']



Display Price column:


0     30
1    120
2     45
3    150
Name: Price, dtype: int64

In [None]:
### Quiz Time

In [14]:
### Quiz display Units_Sold column value

In [8]:
# display Product and Price columns value together

In [10]:
sales_data

Unnamed: 0,Product,Price,Units_Sold
0,Soap,30,100
1,Shampoo,120,60
2,Toothpaste,45,80
3,Oil,150,50


In [13]:
sales_data[['Price','Units_Sold']]

Unnamed: 0,Price,Units_Sold
0,30,100
1,120,60
2,45,80
3,150,50


### How to read csv file 

In [14]:
import pandas as pd

# Load the dataset
file_path = "C:\\Users\\askpr\\Tutedude\\pharma_sales_data.csv" # NOTE USE \\ AS PATH SEPERATOR
df = pd.read_csv(file_path)


In [15]:
df.head()

Unnamed: 0,Product_ID,Product_Name,Category,Sales_Units,Revenue,Region
0,101,PainRelief,Analgesic,500.0,25000.0,North
1,102,CoughSyrup,Cough & Cold,300.0,12000.0,South
2,103,AntibioticX,Antibiotic,,40000.0,East
3,104,VitaminD,Supplement,700.0,,West
4,105,DiabetesMed,Diabetes,450.0,31500.0,North


In [None]:
### Option 2 to read csv file

In [19]:
import pandas as pd

# Load the dataset
#file_path = "C:/Users/askpr/Tutedude/pharma_sales_data.csv" # NOTE USE / AS PATH SEPERATOR
file_path = r"C:\Users\askpr\Tutedude\pharma_sales_data.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Product_ID,Product_Name,Category,Sales_Units,Revenue,Region
0,101,PainRelief,Analgesic,500.0,25000.0,North
1,102,CoughSyrup,Cough & Cold,300.0,12000.0,South
2,103,AntibioticX,Antibiotic,,40000.0,East
3,104,VitaminD,Supplement,700.0,,West
4,105,DiabetesMed,Diabetes,450.0,31500.0,North


In [None]:
#opiton 3 
# r prefix: Stands for "raw" and ensures backslashes are treated literally.

# Without r: C:\\Users\\askpr\\Tutedude\\pharma_sales_data.csv (requires escaping backslashes with \\).

# With r: r"C:\Users\askpr\Tutedude\pharma_sales_data.csv"(no escaping needed).

In [None]:
# option 4 Best option
# Keep csv file and keep python file and csv files in same folder
# Now no need to give path , as python file will read csv file from current location 
# where python file is stored, and csv file is in same location, so it will read.

In [20]:
file_path = "pharma_sales_data.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Product_ID,Product_Name,Category,Sales_Units,Revenue,Region
0,101,PainRelief,Analgesic,500.0,25000.0,North
1,102,CoughSyrup,Cough & Cold,300.0,12000.0,South
2,103,AntibioticX,Antibiotic,,40000.0,East
3,104,VitaminD,Supplement,700.0,,West
4,105,DiabetesMed,Diabetes,450.0,31500.0,North


### To display top 7 records

In [20]:
df.head(7)

Unnamed: 0,Product_ID,Product_Name,Category,Sales_Units,Revenue,Region
0,101,PainRelief,Analgesic,500.0,25000.0,North
1,102,CoughSyrup,Cough & Cold,300.0,12000.0,South
2,103,AntibioticX,Antibiotic,,40000.0,East
3,104,VitaminD,Supplement,700.0,,West
4,105,DiabetesMed,Diabetes,450.0,31500.0,North
5,106,Antacid,Digestive,,18000.0,South
6,107,AllergyPill,Allergy,600.0,,East


###  To display bottom (last) 5 records

In [21]:
df.tail()

Unnamed: 0,Product_ID,Product_Name,Category,Sales_Units,Revenue,Region
3,104,VitaminD,Supplement,700.0,,West
4,105,DiabetesMed,Diabetes,450.0,31500.0,North
5,106,Antacid,Digestive,,18000.0,South
6,107,AllergyPill,Allergy,600.0,,East
7,108,FluShot,Vaccine,800.0,56000.0,West


### **Pharmaceutical Case Study: Understanding Drug Sales Data**
A pharmaceutical company collects sales data to analyze drug demand across regions. The dataset contains:
- `Drug_Name`: Name of the pharmaceutical product.
- `Sales_Qty`: Number of units sold.
- `Region`: Sales region (North, South, East, West).
- `Revenue`: Total revenue from sales.



###  Inspecting Pharmaceutical Sales Data
A pharmaceutical company wants to inspect its dataset before analysis.


### Display basic information

In [2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Product_ID    8 non-null      int64  
 1   Product_Name  8 non-null      object 
 2   Category      8 non-null      object 
 3   Sales_Units   6 non-null      float64
 4   Revenue       6 non-null      float64
 5   Region        8 non-null      object 
dtypes: float64(2), int64(1), object(3)
memory usage: 512.0+ bytes


### Display summary statistics

In [3]:

df.describe()

Unnamed: 0,Product_ID,Sales_Units,Revenue
count,8.0,6.0,6.0
mean,104.5,558.333333,30416.666667
std,2.44949,180.04629,15938.684596
min,101.0,300.0,12000.0
25%,102.75,462.5,19750.0
50%,104.5,550.0,28250.0
75%,106.25,675.0,37875.0
max,108.0,800.0,56000.0


## **Conclusion**
- Pandas simplifies pharmaceutical data analysis.
- DataFrames efficiently store and manipulate structured data.
- How to read csv file, Display data from csv file.
- basic functions like info(),describe() 
