# What are pandas ?

- Pandas is an open-source Python library widely used for data manipulation, data analysis, and data visualization tasks. 
- It provides data structures like Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data table) that make it easy to work with structured data. 
- Pandas is built on top of the NumPy library and integrates well with other libraries in the Python data ecosystem.

<br>



# What is Data Cleaning?

- Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to ensure data quality and reliability. 
- Data cleaning is crucial because real-world data often contains missing values, duplicate entries, incorrect data types, and outliers, which can negatively impact analysis and modeling.

## INSTALLING PANDAS :
- use the following command on your terminal ⇨
---
`pip install pandas`
    
---

- Once installed you can use pandas in your code by importing it using `import` Keyword.
---
`import pandas as pd`

---

---
## SERIES in pandas :
- A Pandas Series is like a column in a table.

- It is a one-dimensional array holding data of any type.
-  Each element in the Series has a label called an index, which allows for fast and flexible data manipulation

In [72]:
import pandas as pd
import numpy as np

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
my_series = pd.Series(data)

my_series


0    10
1    20
2    30
3    40
4    50
dtype: int64

In [4]:
data = [10, 20, 30, 40, 50]
index = ['maths','chem','phy','comp','english']
marks = pd.Series(data,index,name="marks")
marks

maths      10
chem       20
phy        30
comp       40
english    50
Name: marks, dtype: int64

In [5]:
d = {'maths':20,'phys':50}
s = pd.Series(d)
s

maths    20
phys     50
dtype: int64

In [7]:
marks[[0,-1]]

maths      10
english    50
Name: marks, dtype: int64

In [52]:
my_series= pd.read_csv('Data/gpa_study_hours.csv')
my_series

Unnamed: 0,gpa,study_hours
0,4.00,10.0
1,3.80,25.0
2,3.93,45.0
3,3.40,10.0
4,3.20,4.0
...,...,...
188,3.60,24.0
189,3.70,12.0
190,3.84,15.0
191,3.80,10.0


In [9]:
my_series.set_index(my_series['study_hours'],inplace=True)

In [18]:
my_series.rename(columns={'study_hours': 'n'}, inplace=True)

In [19]:
my_series.drop(columns=['n'], inplace=True)

In [20]:
my_series

Unnamed: 0_level_0,gpa
study_hours,Unnamed: 1_level_1
10.0,4.00
25.0,3.80
45.0,3.93
10.0,3.40
4.0,3.20
...,...
24.0,3.60
12.0,3.70
15.0,3.84
10.0,3.80


In [33]:
my_series.sort_index(ascending=)

Unnamed: 0_level_0,gpa
study_hours,Unnamed: 1_level_1
2.0,3.860
3.0,3.400
3.0,3.500
4.0,3.600
4.0,3.400
...,...
49.0,3.500
60.0,3.830
60.0,3.830
60.0,3.825


In [25]:
my_series.dtypes

gpa    float64
dtype: object

In [26]:
type(my_series)

pandas.core.frame.DataFrame

In [28]:
my_series.describe()

Unnamed: 0,gpa
count,193.0
mean,3.586166
std,0.285482
min,2.6
25%,3.4
50%,3.62
75%,3.8
max,4.3


In [32]:
my_series.tail()

Unnamed: 0_level_0,gpa
study_hours,Unnamed: 1_level_1
24.0,3.6
12.0,3.7
15.0,3.84
10.0,3.8
15.0,3.1


---
## DATAFRAME in pandas :
- A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [34]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Tokyo']
}
df = pd.DataFrame(data)

df


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Tokyo


## Reading and Writing Data:

- Pandas can read and write data from/to various file formats, including CSV, Excel, SQL databases, and more.

In [None]:
# Reading data from a CSV file
df = pd.read_csv('data.csv')

# Writing data to a CSV file
df.to_csv('output.csv', index=False)


## Data Selection and Filtering:
- Pandas allows you to select specific data from a DataFrame using various methods like indexing, slicing, and boolean indexing.

In [35]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Tokyo']
}
df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Tokyo


In [37]:
new_df = pd.read_csv("Data/sample.csv")

In [38]:
new_df

Unnamed: 0,name,sex,age,height,weight
0,Aubrey,M,41.0,74,170
1,Ron,M,42.0,68,166
2,Carl,M,32.0,70,155
3,Antonio,M,39.0,72,167
4,Deborah,F,30.0,66,124
5,Jacqueline,F,33.0,66,115
6,Helen,F,26.0,64,121
7,David,M,30.0,71,158
8,James,M,53.0,72,175
9,Michael,M,32.0,69,143


In [44]:
new_df[10:][['name','age']]

Unnamed: 0,name,age
10,Ruth,47.0
11,Joel,34.0
12,Donna,23.0
13,Roger,36.0
14,Yao,
15,Elizabeth,31.0
16,Tim,29.0
17,Susan,28.0


In [47]:
new_df[new_df['age'] > 30]

Unnamed: 0,name,sex,age,height,weight
0,Aubrey,M,41.0,74,170
1,Ron,M,42.0,68,166
2,Carl,M,32.0,70,155
3,Antonio,M,39.0,72,167
5,Jacqueline,F,33.0,66,115
8,James,M,53.0,72,175
9,Michael,M,32.0,69,143
10,Ruth,F,47.0,69,139
11,Joel,M,34.0,72,163
13,Roger,M,36.0,75,160


In [41]:
new_df[aa['name','age']].head()

Unnamed: 0,name,age
0,Aubrey,41.0
1,Ron,42.0
2,Carl,32.0
3,Antonio,39.0
4,Deborah,30.0


In [36]:
# Selecting a column
df['Name']

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [None]:
# Slicing rows
df[1:]

In [None]:
# Boolean indexing
df[df['Age'] > 30]

--- ---
## INDEXES :
- In pandas, an index is a unique identifier for each row in a DataFrame. It serves as a label or key for data alignment, selection, and retrieval. 
- By default, when you create a DataFrame, pandas assigns a numeric index starting from 0 to each row.

- However, you can set custom labels as the index, which can be strings, dates, or any other hashable type.

---
**METHODS ⇨**
- You can set a specific column as the index during the DataFrame creation or afterward using the `set_index()` method.
- `reset_index()`: Reset the DataFrame index.

In [50]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Tokyo']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Tokyo


In [51]:
df.set_index('Name', inplace=True)

df

Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,25,New York
Bob,30,London
Charlie,35,Tokyo


In [53]:
my_series.head()

Unnamed: 0,gpa,study_hours
0,4.0,10.0
1,3.8,25.0
2,3.93,45.0
3,3.4,10.0
4,3.2,4.0


In [54]:
my_series.set_index('study_hours',inplace=True)
my_series.head()

Unnamed: 0_level_0,gpa
study_hours,Unnamed: 1_level_1
10.0,4.0
25.0,3.8
45.0,3.93
10.0,3.4
4.0,3.2


In [55]:
df.reset_index(inplace=True)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Tokyo


In [56]:
new_df.head()

Unnamed: 0,name,sex,age,height,weight
0,Aubrey,M,41.0,74,170
1,Ron,M,42.0,68,166
2,Carl,M,32.0,70,155
3,Antonio,M,39.0,72,167
4,Deborah,F,30.0,66,124


In [70]:
new_df[new_df['height'] > 70].loc[:,'name']

0      Aubrey
3     Antonio
7       David
8       James
11       Joel
13      Roger
16        Tim
Name: name, dtype: object

In [62]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    18 non-null     object 
 1   sex     18 non-null     object 
 2   age     17 non-null     float64
 3   height  18 non-null     int64  
 4   weight  18 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 848.0+ bytes


In [97]:
new_df.columns

Index(['name', 'sex', 'age', 'height', 'weight'], dtype='object')

In [98]:
new_df.describe()

Unnamed: 0,age,height,weight
count,17.0,18.0,18.0
mean,34.470588,69.055556,146.722222
std,7.763035,3.52257,22.540958
min,23.0,62.0,98.0
25%,30.0,66.25,132.0
50%,32.0,69.5,150.0
75%,39.0,71.75,165.25
max,53.0,75.0,176.0


In [59]:
new_adf.iloc[1:3,0]

1     Ron
2    Carl
Name: name, dtype: object

In [61]:
new_df.loc[:,'name']

0         Aubrey
1            Ron
2           Carl
3        Antonio
4        Deborah
5     Jacqueline
6          Helen
7          David
8          James
9        Michael
10          Ruth
11          Joel
12         Donna
13         Roger
14           Yao
15     Elizabeth
16           Tim
17         Susan
Name: name, dtype: object

In [None]:
# Accessing Data with Index Labels:

# Accessing data for a specific index label
df.loc['Bob']

In [71]:
new_df['age'].unique()

array([41., 42., 32., 39., 30., 33., 26., 53., 47., 34., 23., 36., nan,
       31., 29., 28.])

In [83]:
new_df[new_df['age'].isnull()]

Unnamed: 0,name,sex,age,height,weight
14,Yao,M,,70,145


In [75]:
new_df['age'] == 'nan'

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
Name: age, dtype: bool

---

## Handling Missing Values:

- `isnull() / notnull()`: Detect missing or non-missing values in the DataFrame.
- `dropna()`: Remove rows with missing values.
- `fillna(value)`: Fill missing values with a specified value.


In [86]:
new_df.isnull().sum()

name      0
sex       0
age       1
height    0
weight    0
dtype: int64

In [93]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', None, 'David'],
        'Age': [25, 30, None, 22]}
df = pd.DataFrame(data)

df.isnull()  # Check for missing values


Unnamed: 0,Name,Age
0,False,False
1,False,False
2,True,True
3,False,False


In [94]:
df

Unnamed: 0,Name,Age
0,Alice,25.0
1,Bob,30.0
2,,
3,David,22.0


In [89]:
cleaned_df = df.dropna()
cleaned_df

Unnamed: 0,Name,Age
0,Alice,25.0
1,Bob,30.0
3,David,22.0


In [96]:
df['Age'].mean()

25.666666666666668

In [95]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df

Unnamed: 0,Name,Age
0,Alice,25.0
1,Bob,30.0
2,,25.666667
3,David,22.0


---
## Handling Duplicates:
- `duplicated()`: Detect duplicate rows.
- `drop_duplicates()`: Remove duplicate rows.

In [None]:
data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],
        'Age': [25, 30, 25, 22]}
df = pd.DataFrame(data)

print(df.duplicated())  # Check for duplicate rows

In [None]:
cleaned_df = df.drop_duplicates()
cleaned_df

---
## Data Type Conversion:

- `astype()`: Convert data types of DataFrame columns.

In [None]:
df['Age'] = df['Age'].astype(str)
df.dtypes

---

## Data Transformation | Handling Categorical Data:

- `map()`: Replace values based on a mapping dictionary.
- `get_dummies()`: Create dummy variables for categorical columns.
- `factorize()`: Encode categorical columns with numeric labels.

In [None]:
data = {'Grade': ['A', 'B', 'C', 'A', 'B', 'A']}
df = pd.DataFrame(data)

grade_mapping = {'A': 'Excellent', 'B': 'Good', 'C': 'Average'}
df['Grade'] = df['Grade'].map(grade_mapping)
df

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Gender': ['Female', 'Male', 'Male', 'Male']}
df = pd.DataFrame(data)

df = pd.get_dummies(df, columns=['Gender'])
df


In [None]:
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C']}
df = pd.DataFrame(data)

df['Category'] = df['Category'].factorize()[0]
df

## Handling Text Data :

- `strip(), lstrip(), rstrip()`: Remove leading and trailing whitespaces from string columns.
- `replace()`: Replace specific substrings in string columns.
- `str.lower() / str.upper()`: Convert text to lowercase or uppercase.
- `str.extract()`: Extract specific patterns from text columns using regular expressions.

In [None]:
data = {'Name': ['   Alice  ', '   Bob   ', 'David     '],
        'Age': [25, 30, 22]}
df = pd.DataFrame(data)

df['Name'] = df['Name'].str.strip()
df

In [None]:
df['Name'] = df['Name'].replace('Alice', 'Alex')
df

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie']}
df = pd.DataFrame(data)

df['Name'] = df['Name'].str.lower()
df

In [None]:
data = {'Description': ['Product A is great', 'Product B is awadd.pysome', 'Product C is amazing']}
df = pd.DataFrame(data)

df['Product'] = df['Description'].str.extract(r'Product ([A-Z])')
df


---
## Data Filtering:

- Use boolean indexing to filter rows based on certain conditions.

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 22]}
df = pd.DataFrame(data)

filtered_df = df[df['Age'] > 25]
filtered_df

## Renaming Columns:

- `rename()`: Rename columns in the DataFrame.

In [None]:
df.rename(columns={'Name': 'Full Name'}, inplace=True)
df

## Dropping Columns:

- drop(): Remove specific columns from the DataFrame.

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 22]}
df = pd.DataFrame(data)

df.drop(columns=['Age'], inplace=True)
df

## Handling Datetime Data:
- `to_datetime()`: Convert columns to datetime objects.
- `dt`: Access various components of the datetime column (e.g., year, month, day).

In [None]:
data = {'Date': ['2023-07-15', '2023-07-16', '2023-07-17'],
        'Temperature': [25, 28, 30]}
df = pd.DataFrame(data)

df['Date'] = pd.to_datetime(df['Date'])
print(df.dtypes)

In [None]:
df['Year'] = df['Date'].dt.year
df

---
## Handling Numeric Data:
- `round()`: Round numeric values to a specified number of decimal places.
- `clip()`: Limit numeric values within a specific range.

In [None]:
data = {'Value': [3.1456, 6.789, 2.345]}
df = pd.DataFrame(data)

df['Value'] = df['Value'].round(2)
df

In [None]:
df['Value'] = df['Value'].clip(lower=4, upper=6)
df

---
## SORTING data:
- `sort_values()` is used to sort the DataFrame based on the values of one or more columns. By default, it sorts the DataFrame in ascending order, but you can specify `ascending=False` to sort in descending order.
- `sort_index()` is used to sort the DataFrame based on its index (row labels).




In [None]:
import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 20, 30, 22],
    'Salary': [50000, 40000, 60000, 45000]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
df

In [None]:
# Sort the DataFrame by 'Age' in ascending order
df_sorted_age = df.sort_values(by='Age')

print("\nDataFrame sorted by Age (ascending):")
df_sorted_age

In [None]:
# Sort the DataFrame by 'Salary' in descending order
df_sorted_salary = df.sort_values(by='Salary', ascending=False)

print("\nDataFrame sorted by Salary (descending):")
df

In [None]:
import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 20, 30],
    'Salary': [50000, 40000, 60000]
}
df = pd.DataFrame(data)

# Change the index of the DataFrame
df.index = ['c', 'a', 'b']

print("Original DataFrame:")
print(df)

# Sort the DataFrame by index in ascending order
df_sorted_index = df.sort_index()

print("\nDataFrame sorted by index (ascending):")
print(df_sorted_index)

# Sort the DataFrame by index in descending order
df_sorted_index_desc = df.sort_index(ascending=False)

print("\nDataFrame sorted by index (descending):")
print(df_sorted_index_desc)


## Unique Values :
- `nunique()` method in pandas is used to count the number of unique elements in a Series or DataFrame. It returns the count of distinct elements in a Series or the count of unique combinations of values in a DataFrame.
- `value_counts()` method is used to count the occurrences of unique values in a Series. It is a convenient way to get a frequency distribution of the unique values in the Series

In [None]:
data = [1, 2, 3, 2, 1, 4, 3, 5, 5]
my_series = pd.Series(data)

# Count the number of unique elements in the Series
num_unique_elements = my_series.nunique()
num_unique_elements

In [None]:
data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple', 'banana', 'kiwi', 'kiwi']
fruits_series = pd.Series(data)

# Get the frequency count of each unique value
fruits_series.value_counts()

In [None]:
data = {'name': ['Aubrey', 'Ron', 'Carl', 'Antonio', 'Deborah', 'Jacqueline', 'Helen', 'David', 'James', 'Michael', 'Ruth', 'Joel', 'Donna', 'Roger', 'Yao', 'Elizabeth', 'Tim', 'Susan'],
        'sex': ['M', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F'],
        'age': [41, 42, 32, 39, 30, 33, 26, 30, 53, 32, 47, 34, 23, 36, None, 31, 29, 28],
        'height': [74, 68, 70, 72, 66, 66, 64, 71, 72, 69, 69, 72, 62, 75, 70, 67, 71, 65],
        'weight': [170, 166, 155, 167, 124, 115, 121, 158, 175, 143, 139, 163, 98, 160, 145, 135, 176, 131]}

df = pd.DataFrame(data)

In [None]:
df.to_csv("Data/sample.csv",index=False)