# What is Pandas?
---

Pandas is a powerful and popular open-source Python library used for data analysis and data manipulation. It is widely used in Data Science, Machine Learning, AI, Data Analytics, and Finance because it provides fast, flexible, and easy-to-use data structures.

---

**Key Points about Pandas**

---
| Feature                  | Description                                                  |
| ------------------------ | ------------------------------------------------------------ |
| **Data Structures**      | Provides **Series** (1D) and **DataFrame** (2D tabular data) |
| **Handles Missing Data** | Easily manage NA/Null values                                 |
| **Data Cleaning**        | Replace, filter, drop, modify data                           |
| **Data Analysis**        | Perform group, sort, filter, merge, and aggregate            |
| **File Operations**      | Read/Write CSV, Excel, JSON, SQL, etc.                       |
| **Fast Performance**     | Built on top of **NumPy**, optimized for performance         |



---

**üì¶ How to Install Pandas**
---
        pip install pandas

**How to import**
---
        import pandas as pd

---
# Features of Pandas :

- Import Data Sets (CSV, SQL, Excel, etc.)
- Data Cleaning
- Size Mutability (Add / Delete Rows & Columns)
- Reshaping & Pivot Table
- Efficient Manipulation & Extraction
- Statistical Analysis
  
---

# üèÅ Conclusion
---
Pandas is a powerful data manipulation and analysis library that allows:
- ‚úî Easy data import
- ‚úî Data cleaning and transformation
- ‚úî Analytical & statistical operations
- ‚úî Handling large datasets efficiently

In [1]:
import pandas as pd
import numpy as np

In [2]:
s=pd.Series([22,33,44],name="data")
df=s.to_frame()
df

Unnamed: 0,data
0,22
1,33
2,44


In [3]:
arr= np.arange(1,16)
df1=pd.Series(arr)
print(df1)

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
dtype: int64


In [4]:
arr[2]

np.int64(3)

In [5]:
df1.iloc[11]

np.int64(12)

In [6]:
df1

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
14    15
dtype: int64

In [7]:
df1.iloc[1:12]

1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
dtype: int64

# üßæ Main Data Structures in Pandas
1. Series (1-D labeled array)
---

In [8]:
import pandas as pd
s = pd.Series([10, 20, 30, 40])
print(s)


0    10
1    20
2    30
3    40
dtype: int64


2. DataFrame (2-D table like Excel sheet)
---

In [9]:
data = {
    "Name": ["Aman", "Suyash", "Rahul"],
    "Age": [20, 21, 22]
}

df = pd.DataFrame(data)
print(df)


     Name  Age
0    Aman   20
1  Suyash   21
2   Rahul   22


# üîß Important Functions in Pandas
- Creating and Viewing Data
---

In [10]:
df.head()       # first 5 rows
df.tail()       # last 5 rows
df.info()       # column info
df.describe()   # summary statistics


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes


Unnamed: 0,Age
count,3.0
mean,21.0
std,1.0
min,20.0
25%,20.5
50%,21.0
75%,21.5
max,22.0


In [11]:
#Selecting Columns & Rows
df['Name']            # select column
df.iloc[0]            # select row by index
df.loc[1, 'Age']      # select by label

np.int64(21)

In [12]:
#Filtering
df[df['Age'] > 20]

Unnamed: 0,Name,Age
1,Suyash,21
2,Rahul,22


In [13]:
#Sorting
df.sort_values('Age')

Unnamed: 0,Name,Age
0,Aman,20
1,Suyash,21
2,Rahul,22


In [14]:
#Adding and Removing Columns
df['City'] = ['Delhi','Mumbai','Patna']   # add
df.drop('Age', axis=1, inplace=True)      # delete

In [15]:
#Handling Missing Values
df.fillna(0)            # replace missing
df.dropna()             # remove missing rows

Unnamed: 0,Name,City
0,Aman,Delhi
1,Suyash,Mumbai
2,Rahul,Patna


In [16]:
#üìÇ File Read/Write
pd.read_csv('file.csv')
df.to_csv('output.csv', index=False)

pd.read_excel('file.xlsx')
df.to_excel('output.xlsx', index=False)

FileNotFoundError: [Errno 2] No such file or directory: 'file.csv'

In [None]:
#ü§ù Merging & Joining
pd.merge(df1, df2, on='id')

NameError: name 'df2' is not defined

# üèÅ Summary
Pandas is used for:

- Importing and exporting datasets
- Cleaning and transforming data
- Analysing large datasets easily
- Preparing data for machine learning

# üéØ Why do we use Pandas?
---

**We use Pandas because it helps us:**

**‚úî 1. Store and handle data easily**
- Using DataFrame (like an Excel table) and Series (like a single column)

**‚úî 2. Clean messy data**
- Remove missing values, duplicates, incorrect formats, etc.

**‚úî 3. Analyze data quickly**
- Filter, sort, group, merge, perform statistics, etc.

**‚úî 4. Load and save data easily**
- Supports many file formats like:

- CSV
- Excel
- SQL database
- JSON

**‚úî 5. Prepare data for Machine Learning**
Before training a model, data must be cleaned and structured.


---

# üß† Real-life example:
Suppose you have a CSV file of 10,000 students with marks.

**With Pandas you can easily:**
- Load the file in 1 line
- Remove missing values
- Calculate average marks
- Find top students
- Group by class or section

Without Pandas, doing this manually would take hours.


---


# üèÅ Summary üëç
| Feature         | Why it is useful                            |
| --------------- | ------------------------------------------- |
| DataFrame       | Looks like Excel table, easy to handle data |
| Data Cleaning   | Handle null, duplicates, format             |
| Fast Operations | Works efficiently even on large data        |
| File Handling   | Read/write CSV, Excel, SQL easily           |
| Analysis        | Sorting, filtering, grouping, merging       |
| ML Support      | Preprocess data for machine learning        |


In [17]:
df = pd.read_csv("../Data-Analysis-Datasets/Datasets-Practice/hr_data.csv")
df

Unnamed: 0,employee_id,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department,salary
0,1003,2,157,3,0,1,0,sales,low
1,1005,5,262,6,0,1,0,sales,medium
2,1486,7,272,4,0,1,0,sales,medium
3,1038,5,223,5,0,1,0,sales,low
4,1057,2,159,3,0,1,0,sales,low
...,...,...,...,...,...,...,...,...,...
14994,87670,2,151,3,0,1,0,support,low
14995,87673,2,160,3,0,1,0,support,low
14996,87679,2,143,3,0,1,0,support,low
14997,87681,6,280,4,0,1,0,support,low


In [22]:
# Show basic preview
l=df.head()
print(l)

   employee_id  number_project  average_montly_hours  time_spend_company  \
0         1003               2                   157                   3   
1         1005               5                   262                   6   
2         1486               7                   272                   4   
3         1038               5                   223                   5   
4         1057               2                   159                   3   

   Work_accident  left  promotion_last_5years department  salary  
0              0     1                      0      sales     low  
1              0     1                      0      sales  medium  
2              0     1                      0      sales  medium  
3              0     1                      0      sales     low  
4              0     1                      0      sales     low  


# Now we will go Beginner ‚Üí Advanced Pandas functions

We will learn and immediately apply each function using your uploaded dataset.

**üìå Step 1: Basic Pandas Functions**
---
| Category               | Functions                                        |
| ---------------------- | ------------------------------------------------ |
| Basic Info             | `head()`, `tail()`, `shape`, `columns`, `dtypes` |
| Summary & Stats        | `describe()`, `info()`, `value_counts()`         |
| Selection & Filtering  | `loc[]`, `iloc[]`, `query()`                     |
| Handling Missing Data  | `isnull()`, `fillna()`, `dropna()`               |
| Grouping & Aggregation | `groupby()`, `agg()`                             |
| Sorting                | `sort_values()`                                  |
| Merging & Joining      | `merge()`, `concat()`                            |
| Visualization          | `plot()` (later with matplotlib)                 |


In [24]:
df.info()
print("-----------------------------------------")

# df.shape
# print("-----------------------------------------")



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   employee_id            14999 non-null  int64 
 1   number_project         14999 non-null  int64 
 2   average_montly_hours   14999 non-null  int64 
 3   time_spend_company     14999 non-null  int64 
 4   Work_accident          14999 non-null  int64 
 5   left                   14999 non-null  int64 
 6   promotion_last_5years  14999 non-null  int64 
 7   department             14999 non-null  object
 8   salary                 14999 non-null  object
dtypes: int64(7), object(2)
memory usage: 1.0+ MB
-----------------------------------------


In [26]:
print(df.describe())
print("-----------------------------------------")

        employee_id  number_project  average_montly_hours  time_spend_company  \
count  14999.000000    14999.000000          14999.000000        14999.000000   
mean   45424.627575        3.803054            201.050337            3.498233   
std    25915.900127        1.232592             49.943099            1.460136   
min     1003.000000        2.000000             96.000000            2.000000   
25%    22872.500000        3.000000            156.000000            3.000000   
50%    45448.000000        4.000000            200.000000            3.000000   
75%    67480.500000        5.000000            245.000000            4.000000   
max    99815.000000        7.000000            310.000000           10.000000   

       Work_accident          left  promotion_last_5years  
count   14999.000000  14999.000000           14999.000000  
mean        0.144610      0.238083               0.021268  
std         0.351719      0.425924               0.144281  
min         0.000000      0.00

In [27]:
print(df.shape)
print("-----------------------------------------")


(14999, 9)
-----------------------------------------


In [29]:
print(df.columns)
print("-----------------------------------------")



Index(['employee_id', 'number_project', 'average_montly_hours',
       'time_spend_company', 'Work_accident', 'left', 'promotion_last_5years',
       'department', 'salary'],
      dtype='object')
-----------------------------------------


In [30]:
print(df.dtypes)
print("-----------------------------------------")

employee_id               int64
number_project            int64
average_montly_hours      int64
time_spend_company        int64
Work_accident             int64
left                      int64
promotion_last_5years     int64
department               object
salary                   object
dtype: object
-----------------------------------------


In [32]:

df['salary'].value_counts()


salary
low       7316
medium    6446
high      1237
Name: count, dtype: int64

In [33]:
df['department'].value_counts()

department
sales          4140
technical      2720
support        2229
IT             1227
product_mng     902
marketing       858
RandD           787
accounting      767
hr              739
management      630
Name: count, dtype: int64

In [34]:

df[df['salary'] == 'high']    # high salary employees


Unnamed: 0,employee_id,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department,salary
72,22316,2,149,3,0,1,0,product_mng,high
111,1698,6,289,4,0,1,0,hr,high
189,2156,2,156,3,0,1,0,technical,high
267,2672,2,129,3,0,1,0,technical,high
306,57209,2,149,3,0,1,0,marketing,high
...,...,...,...,...,...,...,...,...,...
14829,86807,2,148,3,0,1,0,marketing,high
14868,86989,2,130,3,0,1,0,support,high
14902,87131,2,159,3,0,1,0,hr,high
14941,87339,2,131,3,0,1,0,RandD,high


In [35]:
df[df['left'] == 1]           # employees who left company

Unnamed: 0,employee_id,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department,salary
0,1003,2,157,3,0,1,0,sales,low
1,1005,5,262,6,0,1,0,sales,medium
2,1486,7,272,4,0,1,0,sales,medium
3,1038,5,223,5,0,1,0,sales,low
4,1057,2,159,3,0,1,0,sales,low
...,...,...,...,...,...,...,...,...,...
14994,87670,2,151,3,0,1,0,support,low
14995,87673,2,160,3,0,1,0,support,low
14996,87679,2,143,3,0,1,0,support,low
14997,87681,6,280,4,0,1,0,support,low


In [36]:
df.groupby('department')['left'].mean()

department
IT             0.222494
RandD          0.153748
accounting     0.265971
hr             0.290934
management     0.144444
marketing      0.236597
product_mng    0.219512
sales          0.244928
support        0.248991
technical      0.256250
Name: left, dtype: float64

In [37]:
df.groupby('salary')['average_montly_hours'].mean()

salary
high      199.867421
low       200.996583
medium    201.338349
Name: average_montly_hours, dtype: float64

In [38]:
# sorting

df.sort_values(by='average_montly_hours', ascending=False).head()


Unnamed: 0,employee_id,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department,salary
14975,87560,7,310,4,0,1,0,hr,medium
14972,87539,6,310,4,0,1,0,accounting,medium
1059,7079,2,310,3,0,1,0,product_mng,low
809,5641,7,310,4,0,1,0,support,medium
803,5593,6,310,4,0,1,0,technical,medium
