# Exploratory Data Analysis and Data Preprocessing #
You will be expected to read in a dataset from the https://github.com/rfordatascience/tidytuesday/tree/main/data/2024 datasets for the purpose of exploratory data analysis and preprocessing.

## Part 1: Exploratory Data Analysis ##

### Assignment 1: Data Overview (20 Points) ###
- Load the dataset into a Pandas DataFrame and display the first few rows.
- Provide a basic description of the dataset, including its shape, columns, and data types.

***Hint***
- Use functions like **.head()**, **.shape**, **.columns**, and **.dtypes** to get an overview of your **DataFrame**.
- Remember that **.info()** can be used to get a concise summary of the **DataFrame** including the non-null count and type of each column.

In [None]:
import pandas as pd


url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-02-13/historical_spending.csv"
df = pd.read_csv(url)

# Assignment 1
print(df.head())
print(df.shape)  # (rows, columns)
print(df.columns)
print(df.dtypes)
print(df.info())

### Assignment 2: Univariate Analysis (10 Points) ###
- For numerical features, calculate descriptive statistics and create histograms.
- For categorical features, count unique values and create bar plots.
  
***Hint***
- Use **.describe()** for a quick statistical summary of the numerical features.
- Utilize **matplotlib** or **seaborn** libraries to create histograms (**hist()** or **sns.histplot()**).
- For categorical data, **value_counts()** can help in understanding the distribution of classes, and you can plot the results using **bar()** or **sns.countplot()**.

### Assignment 3: Bivariate Analysis (10 Points) ###
- Choose three pairs of numerical variables and create scatter plots to explore their relationships.
- Create boxplots for one numerical variable grouped by a categorical variable.

***Hint*** 
- When creating scatter plots with **plt.scatter()** or **sns.scatterplot()**, it might be helpful to color points by a third categorical variable using the hue parameter in **Seaborn**.
- Use **sns.boxplot()** to create boxplots. Consider using the hue parameter if you have sub-categories within your categorical variable.

### Assignment 4: Missing Data and Outliers (10 Points) ###
- Identify any missing values in the dataset.
- Detect outliers in the numerical features using an appropriate method (e.g., Z-score, IQR).

***Hint***
- The **.isnull()** method chained with **.sum()** can help identify missing values.
- Consider using the **scipy.stats** module for Z-score computation or the IQR which is the range between the first and third quartile of your data distribution for outlier detection.

## Part 2: Data Preprocessing ##

### Assignment 5: Handling Missing Values (20 Points) ###
- Choose appropriate methods to handle the missing data (e.g., imputation, removal).
 
***Hint***
- Imputation methods could involve using **.fillna()** with measures like mean (**data.mean()**) for numerical columns and mode (**data.mode().iloc[0]**) for categorical columns.
- For removal, **.dropna()** is straightforward but consider the impact on your dataset size.

### Assignment 6: Dealing with Outliers (10 Points) ###
- Treat or remove the outliers identified earlier based on your chosen methodology.

***Hint*** 
- For outlier removal, you may use boolean indexing based on Z-scores or IQR to filter your data.
- If you don’t want to remove outliers, consider transforming them using methods such as log transformation.

### Assignment 7: Feature Engineering (10 Points) ###
- Create at least one new feature that could be useful for a data mining task.

***Hint*** 
- Think about the domain knowledge related to your dataset that could suggest new features. For instance, if you have date-time information, extracting the day of the week could be useful.
- Also, combining features, if relevant, to create ratios or differences can often reveal useful insights.

### Assignment 8: Data Transformation (10 Points) ###
- Standardize or normalize numerical features.
- Perform any additional transformations you deem necessary (e.g., encoding categorical variables, binning, etc.).

***Hint*** 
- For scaling, **StandardScaler** or **MinMaxScaler** from **sklearn.preprocessing** can be applied to numerical features.
- For normalization, **np.log1p()** (log(1+x)) can help in managing skewed data.
- Use **pd.get_dummies()** or **LabelEncoder/OneHotEncoder** from **sklearn.preprocessing** for encoding categorical variables.