### Pandas - Data Analysis & Manipulation tool

*Reference: Python Pandas 7.pdf (P1)*
#### Pandas is built on top of NumPy
- It means Pandas uses NumPy arrays internally to store data and perform fast calculations.
#### It adds high-level data structures and tools.
- It means Pandas provides easy to use ready-made structures like
    - Series (Single Column)
    - DataFrame (table like Excel)
- It also gives tools for sorting, filtering, grouping and cleaning data.
#### That make it easier to work with tabular, labeled, or heterogeneous datasets
1. `Tabular Data`: Data in rows & columns form.
2. `Labeled Data`: Data with row labels (index) and column names.
3. `Heterogenous Data`: Data with different data types in one table.

In [2]:
# Usage of pandas
import pandas as pd
df = pd.DataFrame({
    "Name":["Rohit","Namo"],
    "CGPA":[9.5,9.7]
})
print(df)

    Name  CGPA
0  Rohit   9.5
1   Namo   9.7


In [None]:
# 1. Series: 
# Series is one-dimensional data structure in Pandas that can store a list of values along with labels (index).
#  It can hold data of any type: integers, floats, strings, Python objects. 
# Values in series can be accessed using indexing and labels.

s = pd.Series(["Rohit","Namo","Manya"])
# s = pd.Series([99,95,92])
# s = pd.Series([99.2,95.5,92.1])
print(s)

# tells the data structure (here Series)
print(type(s))

# tells the data type of the values stored in it.
print(s.dtype)

# Accessing values using Indexing
print(s[1])
print(s[0])


0    Rohit
1     Namo
2    Manya
dtype: object
<class 'pandas.core.series.Series'>
object
Namo
Rohit


In [7]:
# Series with custom indexing
s2 = pd.Series([21,20,25,26],index = ["Rohit","Namo","Charlie","Bob"])
print(s2)

# Accessing values using labels (custom indexing)
print(s2["Rohit"])
print(s2["Namo"])

# Accessing Values using indexes
print(s2[0])

print(s2.index)

Rohit      21
Namo       20
Charlie    25
Bob        26
dtype: int64
21
20
21
Index(['Rohit', 'Namo', 'Charlie', 'Bob'], dtype='object')


  print(s2[0])


In [None]:
# Chracteristics of a series:
# 1. They are Homogeneous - store one type of data. 

# 2. They support Vectorized operations. 
sv = pd.Series([1,2,3])
sv2 = pd.Series([4,5,6])
print(sv + sv2)

# 3. They can handle missing values with NaN. (We see it later) 

# 4. They have mutable values but immutable size, it means modification is allowed 
# for existing data but new data can't be add or old data can't be removed.
# If we try to do a new series will be created.

s = pd.Series([1,2,3,4,5])
s[0] = 100
print(s)

change_s = s.drop(0)

print(s)
print(change_s)


0    5
1    7
2    9
dtype: int64
0    100
1      2
2      3
3      4
4      5
dtype: int64
0    100
1      2
2      3
3      4
4      5
dtype: int64
1    2
2    3
3    4
4    5
dtype: int64


In [None]:
# 2. Dataframe:
# DataFrame is a 2 dimensional, tabular data structure.
# Contains: Rows, Cols, Row labels & Col labels
# Each column in a DataFrame is a "Series".
# DataFrame can be created in 3 ways:
# 1. Using Dictionary
# 2. Using Lists 
# 3. Using Numpy array

info = {
    "Name" : ["Adam", "Eve", "Bob"],  
    "Marks" : [78, 99, 85],  
    "Grade" : ['B', 'O', 'A']  
}

df = pd.DataFrame(info)

print(df)
print(type(df))

print(df.index)
print(df.columns)

# if want to visualize the dataframe table well, write this in next cell
# df

   Name  Marks Grade
0  Adam     78     B
1   Eve     99     O
2   Bob     85     A
<class 'pandas.core.frame.DataFrame'>
RangeIndex(start=0, stop=3, step=1)
Index(['Name', 'Marks', 'Grade'], dtype='object')


In [9]:
df

Unnamed: 0,Name,Marks,Grade
0,Adam,78,B
1,Eve,99,O
2,Bob,85,A


In [45]:
# Accessing values from DataFrame:

# From Columns
print(df["Name"])

# We CANNOT access rows the same way we access columns
# To access rows or even columns also we can use "indexers"

# 1. loc - 
# stands for location (for custom labels or indexing) 
# df.loc[row_label,column_label]
info2 = {
    "Name" : ["Adam", "Eve", "Bob"],  
    "Marks" : [78, 99, 85],  
    "Grade" : ['B', 'O', 'A']  
}
df2 = pd.DataFrame(info2,index=["S101","S102","S103"])
print() #for space
# print(df2)

# accessing data from custom indexing rows
print(df2.loc["S101"])

# accessing data from custom indexing cols
print() #space
print(df2.loc[:,"Name"])

# 2. iloc - Integer Location
# it is used for position-based indexing
# df.iloc[row_position, column_position]

print()
print(df2.iloc[0]) #first row 
print(df2.iloc[:,1]) # second col



S101    Adam
S102     Eve
S103     Bob
Name: Name, dtype: object

Name     Adam
Marks      78
Grade       B
Name: S101, dtype: object

S101    Adam
S102     Eve
S103     Bob
Name: Name, dtype: object

Name     Adam
Marks      78
Grade       B
Name: S101, dtype: object
S101    78
S102    99
S103    85
Name: Marks, dtype: int64


In [47]:
# Creating DataFrom using Lists 
l1 = [["Rohit",96],["Namo",96],["Manya",97]]
df3 = pd.DataFrame(l1,columns=["Name","Marks"])

In [48]:
df3

Unnamed: 0,Name,Marks
0,Rohit,96
1,Namo,96
2,Manya,97


In [51]:
# Creating DataFrom using Numpy array 
import numpy as np
np_arr = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
]) 
df4 = pd.DataFrame(np_arr,columns=["Col1","Col2","Col3"])
print(df4)

   Col1  Col2  Col3
0     1     2     3
1     4     5     6
2     7     8     9


#### Using Pandas to read CSV & JSON files


In [56]:
# CSV Data
df = pd.read_csv("employee_data.csv")
print(df,"\n",type(df))

   ID     Name  Age Department  Salary
0   1    Alice   25         HR   55000
1   2      Bob   32         IT   72000
2   3  Charlie   28    Finance   48000
3   4    David   45  Marketing   91000
4   5      Eva   38         IT   65000
5   6    Frank   29    Finance   50000
6   7    Grace   41         HR   82000
7   8   Hannah   26  Marketing   47000
8   9      Ian   35         IT   75000
9  10    Julia   30    Finance   60000 
 <class 'pandas.core.frame.DataFrame'>


In [57]:
# JSON Data 
df = pd.read_json("employee_data.json")
print(df)

   ID     Name  Age Department  Salary
0   1    Alice   25         HR   55000
1   2      Bob   32         IT   72000
2   3  Charlie   28    Finance   48000
3   4    David   45  Marketing   91000
4   5      Eva   38         IT   65000
5   6    Frank   29    Finance   50000
6   7    Grace   41         HR   82000
7   8   Hannah   26  Marketing   47000
8   9      Ian   35         IT   75000
9  10    Julia   30    Finance   60000


In [None]:
# Exporting Data using Pandas
# df.to_csv("temp.csv")
# df.to_json("temp2.json")
df.to_csv("output.csv", index=False)  # exporting without index It means: do NOT save the DataFrame index in the CSV file.

#### DataFrame Methods

In [86]:
data = {  
'Name': ['Aarav', 'Isha', 'Rohan', 'Sneha', 'Vikram'],  
'Age': [25, 30, 35, 40, 45],  
'City': ['Delhi', 'Mumbai', 'Bangalore', 'Kolkata', 'Chennai']  
}  
df = pd.DataFrame(data)

# List of Dataframe methods
print(df.head()) #Shows the first n rows (default = 5) 
print()
print(df.tail(2)) #Shows the last n rows (default = 5) 
print()
print(df.sample()) #Shows random n rows (default = 1)
print()
print(df.info()) # Displays column names, data types, memory usage  
print()
print(df.describe()) # Shows descriptive statistics for numeric columns.
print()
print(df.nunique()) # It counts how many different (distinct) values are present.
print()

# List of Dataframe attributes:
# What is Attributes & methods ?
# Here df is = A dataframe object and an object has: Information about itself → attributes + Actions it can do → methods
print(df.shape) #  Returns (rows, columns). 
print(df.columns) # List of column names
print(df.dtypes) # Datatype of each column

     Name  Age       City
0   Aarav   25      Delhi
1    Isha   30     Mumbai
2   Rohan   35  Bangalore
3   Sneha   40    Kolkata
4  Vikram   45    Chennai

     Name  Age     City
3   Sneha   40  Kolkata
4  Vikram   45  Chennai

    Name  Age       City
2  Rohan   35  Bangalore

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    5 non-null      object
dtypes: int64(1), object(2)
memory usage: 252.0+ bytes
None

             Age
count   5.000000
mean   35.000000
std     7.905694
min    25.000000
25%    30.000000
50%    35.000000
75%    40.000000
max    45.000000

Name    5
Age     5
City    5
dtype: int64

(5, 3)
Index(['Name', 'Age', 'City'], dtype='object')
Name    object
Age      int64
City    object
dtype: object
