# **Introduction to Pandas** #
**Pandas** is an open-source Python library primarily used for **data manipulation and analysis**. It provides data structures and functions that make it easy to work with structured data, particularly **data in the form of tables**.  
Pandas is widely used in data science, machine learning, and data analysis because of its powerful and easy-to-use data structures, Series and DataFrame.


###**Key Data Structures in Pandas:**###

1. **Series:** A one-dimensional array-like object containing a sequence of values.
Each value is associated with an index label.
2. **DataFrame:** A two-dimensional labeled data structure with columns that can hold different data types.

**Common Data Analysis Tasks with Pandas:**

1. **Data Import/Export:** Reading data from various file formats (CSV, Excel, JSON, etc.) and writing data to different formats.
2. **Data Cleaning:** Handling missing values, removing duplicates, and correcting inconsistencies.
3. **Data Manipulation:** Filtering, sorting, grouping, and transforming data.
4. **Data Analysis:** Statistical calculations, time series analysis, and exploratory data analysis.
5. **Data Visualization:** Creating informative visualizations using libraries like Matplotlib and Seaborn.

**Installing Pandas:**
To use pandas, you need to install it using pip:



```
pip install pandas
```



##**Basic Examples**##
1. **Importing Pandas**

In [1]:
import pandas as pd

2. **Creating a Series**

In [2]:
# Creating a simple Series
data = pd.Series([1, 2, 3, 4, 5])
print(data)

0    1
1    2
2    3
3    4
4    5
dtype: int64


3. **Creating a DataFrame**
The DataFrame is like a table with rows and columns.

In [3]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
print(df)

      Name  Age           City
0    Alice   24       New York
1      Bob   27  San Francisco
2  Charlie   22    Los Angeles


4. **Reading a CSV File**

In [4]:
df = pd.read_csv("data.csv")
print(df.head())  # Display the first few rows of the DataFrame


FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

5. **Data Filtering and Manipulation**

In [None]:
# Selecting rows where age > 23
filtered_df = df[df['Age'] > 23]
print(filtered_df)


## READING OUR TEST FILE ##

In [None]:
import pandas as pd

# Read the CSV file into a Pandas DataFrame
file_name = "Dataset_Lab_Test.csv"  # File name
data = pd.read_csv(file_name)

# Display the first few rows of the DataFrame
print(data.head())


If the dataset has no header, use the header=None parameter and assign column names manually:

```
data = pd.read_csv(file_name, header=None, names=['Year', 'Month', 'Region', 'Product_A', 'Product_B', 'Product_C'])
```

### Calculate MEAN, STANDARD DEVIATION, VARIANCE, MEDIAN, MODE ###

In [None]:
# Calculate statistics for Product_A
average = data['Product_A'].mean()
std_dev = data['Product_A'].std()
variance = data['Product_A'].var()
median = data['Product_A'].median()
mode = data['Product_A'].mode()

# Display the results
print(f"Average (Mean) of Product_A: {average}")
print(f"Standard Deviation of Product_A: {std_dev}")
print(f"Variance of Product_A: {variance}")
print(f"Median of Product_A: {median}")
print(f"Mode of Product_A: {mode.tolist()}")

###**Calculate MEAN grouped by something (year or region)**###


In [None]:
# Group by 'Year' and calculate the average of 'Product_A'
average_product_a_by_year = data.groupby('Year')['Product_A'].mean()

# Display the result
print(average_product_a_by_year)

In [None]:
# Group by 'Region' and calculate the mean for each product
mean_by_region = data.groupby('Region')[['Product_A', 'Product_B', 'Product_C']].mean()

# Display the result
print(mean_by_region)

## DATA CLEANING WITH PANDAS ###

Data cleaning is a crucial part of data preprocessing to ensure your dataset is free from errors, inconsistencies, or irrelevant data. Pandas provides powerful tools for cleaning and preparing data for analysis. Below are common data cleaning tasks and how to perform them with Pandas:

###**1. Handling Missing Data**###

**Check for missing values:**

In [None]:
# Check for missing values in each column
print(data.isnull().sum())

# Display rows with missing data
print(data[data.isnull().any(axis=1)])


**Drop rows or columns with missing values:**

In [None]:
data = data.dropna()  # Removes rows with any missing values
data = data.dropna(axis=1)  # Removes columns with any missing values

**Filling missing values**

In [None]:
# Fill missing values with a specific value
data['Column_Name'] = data['Column_Name'].fillna(0)

# Fill with the mean/median/mode
data['Column_Name'] = data['Column_Name'].fillna(data['Column_Name'].mean())
data['Column_Name'] = data['Column_Name'].fillna(data['Column_Name'].median())
data['Column_Name'] = data['Column_Name'].fillna(data['Column_Name'].mode()[0])

# Forward/Backward fill
data['Column_Name'] = data['Column_Name'].fillna(method='ffill')  # Forward fill
data['Column_Name'] = data['Column_Name'].fillna(method='bfill')  # Backward fill


###**2. Removing Duplicate Rows**###

**`Find duplicates:`**

In [None]:
print(data.duplicated())

**Remove duplicates:**

In [None]:
data = data.drop_duplicates()

###**3. Removing Outliers**###



In [None]:
# Identify outliers using the IQR method
Q1 = data['Product_C'].quantile(0.25)
Q3 = data['Product_C'].quantile(0.75)
IQR = Q3 - Q1

# Define the lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out rows with outliers
data = data[(data['Product_C'] >= lower_bound) & (data['Product_C'] <= upper_bound)]


###**4. Convert Data Types**###



In [None]:
# Correct data types
data['Year'] = data['Year'].astype(int)

###**5. Normalize Product A**###



In [None]:
# Normalize Product_A
data['Normalized_Product_A'] = (data['Product_A'] - data['Product_A'].min()) / (data['Product_A'].max() - data['Product_A'].min())


###**6. Filter Invalid Inputs**###



In [None]:
# Remove rows where a column value does not meet a condition
data = data[data['Year'] > 2021]  # Keep rows where values are greater than 0
data = data[data['Region'].isin(['North', 'South', 'East', 'West'])]  # Keep rows with specific categories
