# What is pandas?

Pandas is an open-source data analysis and manipulation library for the Python programming language. It is built on top of NumPy, another popular library for numerical computing in Python. Pandas provides high-level data structures like DataFrames and Series that are designed to handle tabular and labeled data.

DataFrames are two-dimensional arrays with rows and columns, where each column can have a different data type. They are similar to spreadsheets or SQL tables. Series are one-dimensional arrays that can hold any data type, and they are similar to columns in a DataFrame.

Pandas provides a wide range of functions for data manipulation, including filtering, sorting, grouping, and merging. It can handle missing data, time series data, and categorical data. Pandas is widely used in data science, finance, economics, and other fields where data analysis and manipulation are critical.



# Examples

 Five examples of how Pandas can be used for data analysis and manipulation:

 1. **Data cleaning:** Pandas can be used to clean and preprocess data before analysis. For example, you can use Pandas to remove missing values, replace incorrect values, or convert data types.



```python
import pandas as pd

# Load data into a DataFrame
df = pd.read_csv('my_data.csv')

# Identify and replace missing values
df.replace('?', pd.NA, inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Remove irrelevant columns
df.drop(['column1', 'column2'], axis=1, inplace=True)

# Convert data types
df['numeric_column'] = pd.to_numeric(df['numeric_column'])

# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Export cleaned data to a new file
df.to_csv('cleaned_data.csv', index=False)

```


This code reads in a CSV file, cleans the data by replacing missing values, removing duplicates, and irrelevant columns, converts data types, renames columns, and exports the cleaned data to a new file.

Of course, the specific data cleaning steps will depend on the data you are working with, but this code should give you an idea of how Pandas can be used for data cleaning.

2. **Data exploration:** Pandas can help you explore and understand your data. You can use Pandas to generate descriptive statistics, visualize data using charts and plots, or identify patterns and trends in your data.



```python
import pandas as pd
import matplotlib.pyplot as plt

# Load data into a DataFrame
df = pd.read_csv('my_data.csv')

# Generate descriptive statistics
print(df.describe())

# Calculate correlations between variables
print(df.corr())

# Visualize data using histograms
df.hist(bins=10, figsize=(10,8))
plt.show()

# Visualize data using scatter plots
df.plot(kind='scatter', x='column1', y='column2', alpha=0.5, color='blue')
plt.show()

```

This code reads in a CSV file, generates descriptive statistics, calculates correlations between variables, and visualizes the data using histograms and scatter plots.

Of course, the specific data exploration steps will depend on the data you are working with, but this code should give you an idea of how Pandas can be used for data exploration. You can also explore data using other visualization techniques like line charts, bar charts, or box plots. Pandas provides many built-in visualization functions that you can use to explore your data.


3. **Data transformation:** Pandas can transform your data to prepare it for analysis. For example, you can use Pandas to group data by certain criteria, pivot tables, or reshape data to fit different analytical needs.



```python
import pandas as pd

# Load data into a DataFrame
df = pd.read_csv('my_data.csv')

# Group data by a categorical variable
grouped = df.groupby('category_column')
grouped_mean = grouped.mean()

# Pivot the data to create a new DataFrame
pivoted = df.pivot(index='column1', columns='column2', values='numeric_column')

# Reshape the data using melt function
melted = pd.melt(df, id_vars=['column1'], value_vars=['column2', 'column3'])

# Merge two DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value2': [5, 6, 7, 8]})
merged = pd.merge(df1, df2, on='key')

# Export transformed data to a new file
grouped_mean.to_csv('grouped_mean.csv', index=False)
pivoted.to_csv('pivoted.csv', index=False)
melted.to_csv('melted.csv', index=False)
merged.to_csv('merged.csv', index=False)

```

This code reads in a CSV file, performs several data transformations like grouping, pivoting, reshaping, and merging, and exports the transformed data to new files.

Of course, the specific data transformation steps will depend on the data you are working with, but this code should give you an idea of how Pandas can be used for data transformation. Pandas provides many built-in functions and methods that you can use to transform your data in a variety of ways.

4. **Data analysis:** Pandas can be used for a wide range of data analysis tasks. For example, you can use Pandas to perform statistical analysis, perform hypothesis testing, or build predictive models.



```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load data into a DataFrame
df = pd.read_csv('my_data.csv')

# Calculate summary statistics
mean = df['numeric_column'].mean()
median = df['numeric_column'].median()
mode = df['categorical_column'].mode()[0]

# Perform hypothesis testing
t_stat, p_value = stats.ttest_ind(df['column1'], df['column2'])

# Build a predictive model
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)

# Visualize data using charts and plots
sns.barplot(x='category_column', y='numeric_column', data=df)
plt.show()

sns.scatterplot(x='feature1', y='target', data=df)
plt.show()

# Export analysis results to a new file
summary_stats = pd.DataFrame({'mean': [mean], 'median': [median], 'mode': [mode]})
summary_stats.to_csv('summary_stats.csv', index=False)

hypothesis_test = pd.DataFrame({'t_stat': [t_stat], 'p_value': [p_value]})
hypothesis_test.to_csv('hypothesis_test.csv', index=False)

predictions_df = pd.DataFrame({'predictions': predictions})
predictions_df.to_csv('predictions.csv', index=False)

```

This code reads in a CSV file, performs data analysis tasks like calculating summary statistics, performing hypothesis testing, building a predictive model, and visualizing data using charts and plots, and exports the analysis results to new files.

Of course, the specific data analysis tasks will depend on the data you are working with, but this code should give you an idea of how Pandas can be used for data analysis. Pandas provides many built-in functions and methods that you can use to perform various data analysis tasks, and you can also use external libraries like NumPy, SciPy, and scikit-learn for more advanced analysis tasks.


5. **Data visualization:** Pandas can help you visualize your data using a variety of charts and plots. For example, you can use Pandas to create bar charts, line charts, scatter plots, or heatmaps to explore and communicate your data.



```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data into a DataFrame
df = pd.read_csv('my_data.csv')

# Visualize data using line chart
df.plot(x='date_column', y='numeric_column', kind='line')
plt.show()

# Visualize data using bar chart
df['categorical_column'].value_counts().plot(kind='bar')
plt.show()

# Visualize data using scatter plot
sns.scatterplot(x='feature1', y='target', data=df)
plt.show()

# Visualize data using heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.show()

```

This code reads in a CSV file, performs data visualization tasks using Pandas and Seaborn libraries, and displays the visualizations using Matplotlib.

Of course, the specific data visualization tasks will depend on the data you are working with, but this code should give you an idea of how Pandas can be used for data visualization. Pandas provides many built-in functions and methods that you can use to create various types of charts and plots, and you can also use external libraries like Matplotlib and Seaborn for more advanced visualization tasks.



# Exercises with the cars dataset

In [None]:
# 1. Load the cars dataset into a pandas DataFrame and display the first 10 rows. 

In [None]:
# 2. Create a new DataFrame that includes only the rows where the car's make is "Ford" and the model year is between 1970 and 1980.

In [None]:
# 3. Calculate the average miles per gallon (mpg) for each car make in the dataset.

In [None]:
# 4. Sort the cars dataset by horsepower (descending) and display the top 5 rows.

In [None]:
# 5. Create a scatter plot of horsepower (hp) versus weight (lbs) for all cars in the dataset.

In [None]:
# 6. Calculate the correlation between mpg and each of the other numeric columns in the dataset, and display the results in a heatmap.

In [None]:
# 7. Create a new DataFrame that includes only the rows where the car's make is "Chevrolet" or "Ford", and the horsepower is greater than the mean horsepower for all cars in the dataset.

In [None]:
# 8. Create a new column in the DataFrame that contains a string describing the car's fuel efficiency, based on its mpg value: "high" if mpg is greater than or equal to 30, "medium" if mpg is greater than or equal to 20 but less than 30, and "low" if mpg is less than 20.

# Exercises with titanic 

In [None]:
# 1. Load the Titanic dataset into a pandas DataFrame and display the first 10 rows.

In [None]:
# 2. Calculate the percentage of passengers who survived, and the percentage who did not survive.

In [None]:
# 3. Create a bar chart showing the number of passengers in each passenger class (1st, 2nd, 3rd).

In [None]:
# 4. Calculate the average age of male and female passengers separately.

In [None]:
# 5. Create a new DataFrame that includes only the rows where the passenger's age is missing (i.e. NaN).