# What are pandas ?

- Pandas is an open-source Python library widely used for data manipulation, data analysis, and data visualization tasks. 
- It provides data structures like Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data table) that make it easy to work with structured data. 
- Pandas is built on top of the NumPy library and integrates well with other libraries in the Python data ecosystem.

<br>



# What is Data Cleaning?

- Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to ensure data quality and reliability. 
- Data cleaning is crucial because real-world data often contains missing values, duplicate entries, incorrect data types, and outliers, which can negatively impact analysis and modeling.

## INSTALLING PANDAS :
- use the following command on your terminal ⇨
---
`pip install pandas`
    
---

- Once installed you can use pandas in your code by importing it using `import` Keyword.
---
`import pandas as pd`

---

---
## SERIES in pandas :
- A Pandas Series is like a column in a table.

- It is a one-dimensional array holding data of any type.
-  Each element in the Series has a label called an index, which allows for fast and flexible data manipulation

In [49]:
import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
my_series = pd.Series(data)

my_series


0    10
1    20
2    30
3    40
4    50
dtype: int64

---
## DATAFRAME in pandas :
- A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [51]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Tokyo']
}
df = pd.DataFrame(data)

df


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Tokyo


## Reading and Writing Data:

- Pandas can read and write data from/to various file formats, including CSV, Excel, SQL databases, and more.

In [None]:
# Reading data from a CSV file
df = pd.read_csv('data.csv')

# Writing data to a CSV file
df.to_csv('output.csv', index=False)


## Data Selection and Filtering:
- Pandas allows you to select specific data from a DataFrame using various methods like indexing, slicing, and boolean indexing.

In [55]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Tokyo']
}
df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Tokyo


In [56]:
# Selecting a column
df['Name']

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [57]:
# Slicing rows
df[1:]

Unnamed: 0,Name,Age,City
1,Bob,30,London
2,Charlie,35,Tokyo


In [58]:
# Boolean indexing
df[df['Age'] > 30]

Unnamed: 0,Name,Age,City
2,Charlie,35,Tokyo


--- ---
## INDEXES :
- In pandas, an index is a unique identifier for each row in a DataFrame. It serves as a label or key for data alignment, selection, and retrieval. 
- By default, when you create a DataFrame, pandas assigns a numeric index starting from 0 to each row.

- However, you can set custom labels as the index, which can be strings, dates, or any other hashable type.

---
**METHODS ⇨**
- You can set a specific column as the index during the DataFrame creation or afterward using the `set_index()` method.
- `reset_index()`: Reset the DataFrame index.

In [60]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Tokyo']
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)

df


Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,25,New York
Bob,30,London
Charlie,35,Tokyo


In [61]:
df.reset_index(inplace=True)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,London
2,Charlie,35,Tokyo


In [64]:
# Accessing Data with Index Labels:

# Accessing data for a specific index label
df.loc['Bob']

Age         30
City    London
Name: Bob, dtype: object

---

## Handling Missing Values:

- `isnull() / notnull()`: Detect missing or non-missing values in the DataFrame.
- `dropna()`: Remove rows with missing values.
- `fillna(value)`: Fill missing values with a specified value.


In [10]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', None, 'David'],
        'Age': [25, 30, None, 22]}
df = pd.DataFrame(data)

df.isnull()  # Check for missing values


Unnamed: 0,Name,Age
0,False,False
1,False,False
2,True,True
3,False,False


In [9]:
cleaned_df = df.dropna()
cleaned_df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Alice,25
3,David,22


In [8]:
df['Age'].fillna(0, inplace=True)
df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Alice,25
3,David,22


---
## Handling Duplicates:
- `duplicated()`: Detect duplicate rows.
- `drop_duplicates()`: Remove duplicate rows.

In [5]:
data = {'Name': ['Alice', 'Bob', 'Alice', 'David'],
        'Age': [25, 30, 25, 22]}
df = pd.DataFrame(data)

print(df.duplicated())  # Check for duplicate rows

0    False
1    False
2     True
3    False
dtype: bool


In [7]:
cleaned_df = df.drop_duplicates()
cleaned_df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
3,David,22


---
## Data Type Conversion:

- `astype()`: Convert data types of DataFrame columns.

In [13]:
df['Age'] = df['Age'].astype(str)
df.dtypes

Name    object
Age     object
dtype: object

---

## Data Transformation | Handling Categorical Data:

- `map()`: Replace values based on a mapping dictionary.
- `get_dummies()`: Create dummy variables for categorical columns.
- `factorize()`: Encode categorical columns with numeric labels.

In [14]:
data = {'Grade': ['A', 'B', 'C', 'A', 'B', 'A']}
df = pd.DataFrame(data)

grade_mapping = {'A': 'Excellent', 'B': 'Good', 'C': 'Average'}
df['Grade'] = df['Grade'].map(grade_mapping)
df

Unnamed: 0,Grade
0,Excellent
1,Good
2,Average
3,Excellent
4,Good
5,Excellent


In [21]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Gender': ['Female', 'Male', 'Male', 'Male']}
df = pd.DataFrame(data)

df = pd.get_dummies(df, columns=['Gender'])
df


Unnamed: 0,Name,Gender_Female,Gender_Male
0,Alice,1,0
1,Bob,0,1
2,Charlie,0,1
3,David,0,1


In [32]:
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C']}
df = pd.DataFrame(data)

df['Category'] = df['Category'].factorize()[0]
df

Unnamed: 0,Category
0,0
1,1
2,2
3,0
4,1
5,2


## Handling Text Data :

- `strip(), lstrip(), rstrip()`: Remove leading and trailing whitespaces from string columns.
- `replace()`: Replace specific substrings in string columns.
- `str.lower() / str.upper()`: Convert text to lowercase or uppercase.
- `str.extract()`: Extract specific patterns from text columns using regular expressions.

In [16]:
data = {'Name': ['   Alice  ', '   Bob   ', 'David     '],
        'Age': [25, 30, 22]}
df = pd.DataFrame(data)

df['Name'] = df['Name'].str.strip()
df

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,David,22


In [17]:
df['Name'] = df['Name'].replace('Alice', 'Alex')
df

Unnamed: 0,Name,Age
0,Alex,25
1,Bob,30
2,David,22


In [31]:
data = {'Name': ['Alice', 'Bob', 'Charlie']}
df = pd.DataFrame(data)

df['Name'] = df['Name'].str.lower()
df

Unnamed: 0,Name
0,alice
1,bob
2,charlie


In [47]:
data = {'Description': ['Product A is great', 'Product B is awadd.pysome', 'Product C is amazing']}
df = pd.DataFrame(data)

df['Product'] = df['Description'].str.extract(r'Product ([A-Z])')
df


Unnamed: 0,Description,Product
0,Product A is great,A
1,Product B is awadd.pysome,B
2,Product C is amazing,C


---
## Data Filtering:

- Use boolean indexing to filter rows based on certain conditions.

In [19]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 22]}
df = pd.DataFrame(data)

filtered_df = df[df['Age'] > 25]
filtered_df

Unnamed: 0,Name,Age
1,Bob,30
2,Charlie,35


## Renaming Columns:

- `rename()`: Rename columns in the DataFrame.

In [22]:
df.rename(columns={'Name': 'Full Name'}, inplace=True)
df

Unnamed: 0,Full Name,Gender_Female,Gender_Male
0,Alice,1,0
1,Bob,0,1
2,Charlie,0,1
3,David,0,1


## Dropping Columns:

- drop(): Remove specific columns from the DataFrame.

In [24]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 22]}
df = pd.DataFrame(data)

df.drop(columns=['Age'], inplace=True)
df

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David


## Handling Datetime Data:
- `to_datetime()`: Convert columns to datetime objects.
- `dt`: Access various components of the datetime column (e.g., year, month, day).

In [26]:
data = {'Date': ['2023-07-15', '2023-07-16', '2023-07-17'],
        'Temperature': [25, 28, 30]}
df = pd.DataFrame(data)

df['Date'] = pd.to_datetime(df['Date'])
print(df.dtypes)

Date           datetime64[ns]
Temperature             int64
dtype: object


In [28]:
df['Year'] = df['Date'].dt.year
df

Unnamed: 0,Date,Temperature,Year
0,2023-07-15,25,2023
1,2023-07-16,28,2023
2,2023-07-17,30,2023


---
## Handling Numeric Data:
- `round()`: Round numeric values to a specified number of decimal places.
- `clip()`: Limit numeric values within a specific range.

In [29]:
data = {'Value': [3.1456, 6.789, 2.345]}
df = pd.DataFrame(data)

df['Value'] = df['Value'].round(2)
df

Unnamed: 0,Value
0,3.15
1,6.79
2,2.35


In [30]:
df['Value'] = df['Value'].clip(lower=4, upper=6)
df

Unnamed: 0,Value
0,4.0
1,6.0
2,4.0


---
## SORTING data:
- `sort_values()` is used to sort the DataFrame based on the values of one or more columns. By default, it sorts the DataFrame in ascending order, but you can specify `ascending=False` to sort in descending order.
- `sort_index()` is used to sort the DataFrame based on its index (row labels).




In [1]:
import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 20, 30, 22],
    'Salary': [50000, 40000, 60000, 45000]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,20,40000
2,Charlie,30,60000
3,David,22,45000


In [2]:
# Sort the DataFrame by 'Age' in ascending order
df_sorted_age = df.sort_values(by='Age')

print("\nDataFrame sorted by Age (ascending):")
df_sorted_age


DataFrame sorted by Age (ascending):


Unnamed: 0,Name,Age,Salary
1,Bob,20,40000
3,David,22,45000
0,Alice,25,50000
2,Charlie,30,60000


In [3]:
# Sort the DataFrame by 'Salary' in descending order
df_sorted_salary = df.sort_values(by='Salary', ascending=False)

print("\nDataFrame sorted by Salary (descending):")
df


DataFrame sorted by Salary (descending):


Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,20,40000
2,Charlie,30,60000
3,David,22,45000


In [4]:
import pandas as pd

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 20, 30],
    'Salary': [50000, 40000, 60000]
}
df = pd.DataFrame(data)

# Change the index of the DataFrame
df.index = ['c', 'a', 'b']

print("Original DataFrame:")
print(df)

# Sort the DataFrame by index in ascending order
df_sorted_index = df.sort_index()

print("\nDataFrame sorted by index (ascending):")
print(df_sorted_index)

# Sort the DataFrame by index in descending order
df_sorted_index_desc = df.sort_index(ascending=False)

print("\nDataFrame sorted by index (descending):")
print(df_sorted_index_desc)


Original DataFrame:
      Name  Age  Salary
c    Alice   25   50000
a      Bob   20   40000
b  Charlie   30   60000

DataFrame sorted by index (ascending):
      Name  Age  Salary
a      Bob   20   40000
b  Charlie   30   60000
c    Alice   25   50000

DataFrame sorted by index (descending):
      Name  Age  Salary
c    Alice   25   50000
b  Charlie   30   60000
a      Bob   20   40000


## Unique Values :
- `nunique()` method in pandas is used to count the number of unique elements in a Series or DataFrame. It returns the count of distinct elements in a Series or the count of unique combinations of values in a DataFrame.
- `value_counts()` method is used to count the occurrences of unique values in a Series. It is a convenient way to get a frequency distribution of the unique values in the Series

In [6]:
data = [1, 2, 3, 2, 1, 4, 3, 5, 5]
my_series = pd.Series(data)

# Count the number of unique elements in the Series
num_unique_elements = my_series.nunique()
num_unique_elements

5

In [9]:
data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple', 'banana', 'kiwi', 'kiwi']
fruits_series = pd.Series(data)

# Get the frequency count of each unique value
fruits_series.value_counts()

apple     3
banana    3
kiwi      2
orange    1
dtype: int64