In [3]:
# Steps in Data Preprocessing

# 1. Data Collection: Gathering raw data from various sources.
# Task 1: Collect data from two different sources and merge them.
# Task 2: Validate the integrity of the collected datasets.
# Task 3: Reflect on challenges faced during data collection and how they were addressed.




In [4]:
import pandas as pd

# 1. Data Collection: Gathering raw data from various sources.

# Task 1: Collect data from two different sources and merge them.
print("\nTask 1: Collect data from two different sources and merge them.")

# Simulate data from Source A (e.g., a CSV file)
data_source_a = {'ID': [1, 2, 3, 4, 5],
                  'Product': ['Laptop', 'Tablet', 'Keyboard', 'Mouse', 'Monitor'],
                  'Price': [1200, 300, 75, 25, 250]}
df_a = pd.DataFrame(data_source_a)
print("\nData from Source A:")
print(df_a)

# Simulate data from Source B (e.g., an API response or another CSV)
data_source_b = {'ID': [3, 4, 5, 6, 7],
                  'Quantity_Sold': [100, 150, 200, 50, 120],
                  'Customer_Rating': [4.5, 4.2, 4.8, 3.9, 4.6]}
df_b = pd.DataFrame(data_source_b)
print("\nData from Source B:")
print(df_b)

# Merge the two DataFrames based on a common key ('ID')
merged_df = pd.merge(df_a, df_b, on='ID', how='inner')
print("\nMerged DataFrame (inner join on 'ID'):")
print(merged_df)

# Task 2: Validate the integrity of the collected datasets.
print("\n\nTask 2: Validate the integrity of the collected datasets.")

print("\nIntegrity checks for Source A:")
print(f"Number of rows: {len(df_a)}")
print(f"Number of unique IDs: {df_a['ID'].nunique()}")
print(f"Data types:\n{df_a.dtypes}")
print(f"Missing values:\n{df_a.isnull().sum()}")

print("\nIntegrity checks for Source B:")
print(f"Number of rows: {len(df_b)}")
print(f"Number of unique IDs: {df_b['ID'].nunique()}")
print(f"Data types:\n{df_b.dtypes}")
print(f"Missing values:\n{df_b.isnull().sum()}")

print("\nIntegrity checks for Merged DataFrame:")
print(f"Number of rows: {len(merged_df)}")
print(f"Number of unique IDs: {merged_df['ID'].nunique()}")
print(f"Data types:\n{merged_df.dtypes}")
print(f"Missing values:\n{merged_df.isnull().sum()}")

# Further validation examples:
# Check for consistent data types across common columns (after potential initial loading)
# Check for expected value ranges (e.g., price should not be negative)
# Check for logical inconsistencies between columns

# Task 3: Reflect on challenges faced during data collection and how they were addressed.
print("\n\nTask 3: Reflect on challenges faced during data collection and how they were addressed.")

reflection = """
Reflecting on potential challenges during data collection and how they might be addressed:

**Challenge 1: Different Data Formats:**
- **Description:** Data might come in various formats (CSV, JSON, Excel, databases, APIs).
- **Address:** Use appropriate libraries in Python (e.g., pandas for tabular data, json for JSON, requests for APIs) to read and parse data from different sources. Ensure consistent parsing and handling of data types during the loading process.

**Challenge 2: Inconsistent Schemas:**
- **Description:** The structure (column names, order) of data might differ between sources, even if they contain similar information.
- **Address:** Before merging, standardize column names (e.g., convert to lowercase, replace spaces). Select and align relevant columns. You might need to rename columns in one or both DataFrames before merging to ensure the join key and other relevant columns have the same names.

**Challenge 3: Data Integrity Issues in Sources:**
- **Description:** Individual data sources might contain errors, missing values, or inconsistencies (e.g., different units, typos).
- **Address:** Perform initial validation on each source *before* merging. Identify and flag or handle missing values (imputation, removal). Standardize units and correct obvious errors. This early cleaning can prevent propagation of issues in the merged dataset.

**Challenge 4: Handling Different Granularity:**
- **Description:** Data from different sources might be at different levels of detail (e.g., one source has daily sales, another has monthly).
- **Address:** Decide on the desired level of granularity for your analysis. You might need to aggregate data from a finer granularity source or disaggregate (with caution and assumptions) from a coarser one before merging.

**Challenge 5: Ensuring Unique Identifiers for Merging:**
- **Description:** Merging relies on common identifiers. If these identifiers are inconsistent or not truly unique across sources, the merge can produce incorrect results (e.g., many-to-many joins when one-to-one is expected).
- **Address:** Understand the nature of the identifiers in each source. You might need to create a composite key from multiple columns to ensure uniqueness or perform careful analysis of the join keys to identify and resolve inconsistencies before merging.

**Challenge 6: Authentication and Access Issues:**
- **Description:** Accessing data from APIs or databases might require authentication, and there could be rate limits or access restrictions.
- **Address:** Implement proper authentication mechanisms. Handle API rate limits with delays or batch

SyntaxError: incomplete input (1380038456.py, line 58)

In [None]:
# 2. Data Cleaning: Addressing missing values, duplicates, incorrect types, and outliers.
# Task 1: Clean a given dataset and document the changes made.
# Task 2: Create a checklist to ensure comprehensive data cleaning in future projects.
# Task 3: Collaborate with a peer to clean a new dataset and present your solutions.



In [None]:
import pandas as pd
import numpy as np
from scipy import stats

# 2. Data Cleaning: Addressing missing values, duplicates, incorrect types, and outliers.

# Task 1: Clean a given dataset and document the changes made.
print("\nTask 1: Clean a given dataset and document the changes made.")

# Sample dataset with various cleaning needs
data_dirty = {'ID': [1, 2, 3, 4, 2, 5, 6, 1, 7, 8],
              'Name': ['Alice', 'bob', 'Charlie', np.nan, 'Eve', 'BOB', 'Frank', 'Alice', 'Grace', 'Harry'],
              'Age': ['25', '30.0', 35, '28', '40', '30', '45', '25', np.nan, -5],
              'Salary': [50000, 60000, np.nan, 70000, 80000, 60000, '55000', 90000, 75000, 800000],
              'City': ['New York', 'london', 'Paris', 'London', 'tokyo', 'london', 'Berlin', 'new york', 'Sydney', 'Bengaluru'],
              'Enrollment_Date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023-04-05', '2023-05-01',
                                  '2023-02-20', '2023-06-12', '2023-01-15', '2023-07-21', '2023-08-30']}
df_dirty = pd.DataFrame(data_dirty)

print("\nOriginal Dirty DataFrame:")
print(df_dirty)
print(f"\nOriginal DataFrame Info:\n{df_dirty.info()}")

cleaning_log = []

# --- Step 1: Handle Duplicates ---
initial_shape = df_dirty.shape
df_cleaned = df_dirty.drop_duplicates()
duplicates_removed = initial_shape[0] - df_cleaned.shape[0]
cleaning_log.append(f"Removed {duplicates_removed} duplicate rows.")

# --- Step 2: Standardize Text Data ---
df_cleaned['Name'] = df_cleaned['Name'].str.lower()
df_cleaned['City'] = df_cleaned['City'].str.lower()
cleaning_log.append("Standardized 'Name' and 'City' columns to lowercase.")

# --- Step 3: Handle Missing Values ---
missing_before = df_cleaned.isnull().sum()
df_cleaned['Name'].fillna('unknown', inplace=True)
df_cleaned['Age'].fillna(df_cleaned['Age'].median(), inplace=True)
df_cleaned['Salary'].fillna(df_cleaned['Salary'].median(), inplace=True)
missing_after = df_cleaned.isnull().sum()
cleaning_log.append(f"Filled missing 'Name' with 'unknown', 'Age' with median ({df_cleaned['Age'].median()}), and 'Salary' with median ({df_cleaned['Salary'].median()}).")

# --- Step 4: Correct Data Types ---
df_cleaned['Age'] = pd.to_numeric(df_cleaned['Age'], errors='coerce').astype('Int64')
df_cleaned['Salary'] = pd.to_numeric(df_cleaned['Salary'], errors='coerce')
df_cleaned['Enrollment_Date'] = pd.to_datetime(df_cleaned['Enrollment_Date'], errors='coerce')
cleaning_log.append("Converted 'Age' to integer, 'Salary' to numeric, and 'Enrollment_Date' to datetime. Invalid conversions resulted in NaT/NaN.")

# --- Step 5: Handle Outliers (Age) ---
age_mean = df_cleaned['Age'].mean()
age_std = df_cleaned['Age'].std()
age_threshold = 3
outliers_age = df_cleaned[(df_cleaned['Age'] < age_mean - age_threshold * age_std) | (df_cleaned['Age'] > age_mean + age_threshold * age_std)]
df_cleaned['Age'] = np.where(df_cleaned['Age'] < 0, np.nan, df_cleaned['Age']) # Correcting illogical negative age
df_cleaned['Age'].fillna(df_cleaned['Age'].median(), inplace=True) # Re-impute any NaNs from illogical values
cleaning_log.append(f"Handled illogical negative values in 'Age' by setting them to NaN and then re-imputing with the median.")

# --- Step 6: Handle Outliers (Salary) ---
salary_mean = df_cleaned['Salary'].mean()
salary_std = df_cleaned['Salary'].std()
salary_threshold = 3
outliers_salary = df_cleaned[(df_cleaned['Salary'] < salary_mean - salary_threshold * salary_std) | (df_cleaned['Salary'] > salary_mean + salary_threshold * salary_std)]
# For demonstration, let's cap outliers instead of removing
lower_salary_bound = salary_mean - salary_threshold * salary_std
upper_salary_bound = salary_mean + salary_threshold * salary_std
df_cleaned['Salary'] = np.clip(df_cleaned['Salary'], lower_salary_bound, upper_salary_bound)
cleaning_log.append(f"Capped outliers in 'Salary' to the range [{lower_salary_bound:.2f}, {upper_salary_bound:.2f}].")

print("\nCleaned DataFrame:")
print(df_cleaned)
print(f"\nCleaned DataFrame Info:\n{df_cleaned.info()}")

print("\nCleaning Log:")
for log_entry in cleaning_log:
    print(f"- {log_entry}")

# Task 2: Create a checklist to ensure comprehensive data cleaning in future projects.
print("\n\nTask 2: Create a checklist for comprehensive data cleaning.")

data_cleaning_checklist = [
    "Understand the data: Source, meaning of columns, expected data types.",
    "Identify missing values: Determine the extent and patterns of missing data.",
    "Handle missing values: Choose appropriate imputation or removal strategies (document the choice).",
    "Identify duplicate data: Check for and quantify duplicate rows.",
    "Handle duplicate data: Remove duplicates, deciding which occurrences to keep (document the decision).",
    "Standardize text data: Ensure consistent casing, remove extra whitespace, handle special characters if needed.",
    "Correct data types: Verify and convert columns to the appropriate data types (numeric, string, datetime, boolean).",
    "Identify outliers: Use visualization (boxplots, scatter plots) and statistical methods (Z-score, IQR) to detect outliers.",
    "Handle outliers: Choose appropriate strategies (removal, capping, transformation) based on the nature of the outliers and the analysis goals (document the choice).",
    "Address inconsistencies: Look for and resolve inconsistencies in data values (e.g., different units, contradictory entries).",
    "Validate data integrity: Perform checks for logical errors and data range validity.",
    "Document all cleaning steps: Maintain a log of changes made and the reasoning behind them.",
    "Review and iterate: After initial cleaning, review the data for any remaining issues and iterate as needed."
]

print("\nData Cleaning Checklist:")
for item in data_cleaning_checklist:
    print(f"- {item}")

# Task 3: Collaborate with a peer to clean a new dataset and present your solutions.
print("\n\nTask 3: Collaborate with a peer to clean a new dataset and present your solutions.")

print("\n(Imagine collaborating with a peer on a new dataset here.)")
print("\nFor the purpose of this exercise, let's assume we collaborated on a dataset (not provided here) and our combined solution involved:")
collaboration_summary = [
    "Identified and removed 15 duplicate entries based on 'User_ID' and 'Timestamp'.",
    "Imputed missing 'Review_Score' (numerical) with the median.",
    "Standardized the 'Product_Category' column to have consistent capitalization.",
    "Converted the 'Order_Date' column to datetime objects.",
    "Identified potential outliers in 'Transaction_Amount' using the IQR method and decided to cap them to the 1st and 99th percentiles to retain the data while reducing the impact of extreme values.",
    "Documented each step and the reasoning behind the chosen methods."
]

print("\nSummary of Collaborative Data Cleaning:")
for item in collaboration_summary:
    print(f"- {item}")
```

**Explanation:**

**Task 1: Clean a Given Dataset and Document Changes**

1.  **Load Dirty Data:** We start with a sample Pandas DataFrame (`df_dirty`) containing various data quality issues: duplicates, inconsistent text, missing values, incorrect data types, and outliers.
2.  **Cleaning Log:** We initialize an empty list `cleaning_log` to record each cleaning step performed.
3.  **Handle Duplicates:** We use `drop_duplicates()` to remove duplicate rows and record the number of duplicates removed.
4.  **Standardize Text:** We convert the 'Name' and 'City' columns to lowercase for consistency.
5.  **Handle Missing Values:** We identify missing values using `isnull().sum()` and then fill them using different strategies: 'Name' with 'unknown', 'Age' with the median, and 'Salary' with the median. We record the imputation strategies.
6.  **Correct Data Types:** We use `pd.to_numeric()` to convert 'Age' and 'Salary' to numeric types and `pd.to_datetime()` for 'Enrollment\_Date'. We note that invalid conversions will result in `NaN` or `NaT`.
7.  **Handle Outliers (Age):** We identify potential outliers in 'Age' using the Z-score method. We also correct illogical negative age values by setting them to `NaN` and then re-imputing.
8.  **Handle Outliers (Salary):** We identify potential outliers in 'Salary' using the Z-score method and then demonstrate capping (winsorizing) the outliers to a range defined by the mean and 3 standard deviations.
9.  **Print Cleaned Data and Log:** We display the cleaned DataFrame and iterate through the `cleaning_log` to show all the steps taken.

**Task 2: Create a Data Cleaning Checklist**

We create a comprehensive checklist (`data_cleaning_checklist`) outlining the essential steps involved in data cleaning for future projects. This checklist serves as a guide to ensure a thorough and systematic approach to data preparation.

**Task 3: Collaborate with a Peer**

This task is designed to be interactive. For the purpose of this solo exercise, we simulate a collaboration scenario. The output provides a placeholder indicating that this would involve working with another person on a new dataset. We then present a `collaboration_summary` that outlines the hypothetical steps and decisions made during such a collaborative cleaning effort. This emphasizes the importance of communication, shared understanding, and consistent documentation when working in a team.

This comprehensive example demonstrates the key aspects of data cleaning, from identifying and addressing various data quality issues to documenting the process and considering collaborative efforts. A well-cleaned dataset is crucial for reliable and meaningful data analysis and machine learning modeling.

In [None]:
# 3. Data Transformation: Modifying data to fit specific analytical requirements.
# Task 1: Transform a date column into separate 'day', 'month', and 'year' columns.
# Task 2: Apply normalization to a dataset feature and confirm the changes.
# Task 3: Discuss the importance of data transformation in model interpretability.




In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# 3. Data Transformation: Modifying data to fit specific analytical requirements.

# Task 1: Transform a date column into separate 'day', 'month', and 'year' columns.
print("\nTask 1: Transform a date column into separate 'day', 'month', and 'year' columns.")

# Sample DataFrame with a date column
data_dates = {'ID': [1, 2, 3, 4, 5],
              'Enrollment_Date': ['2023-01-15', '2023-02-20', '2023-03-10', '2024-04-05', '2024-05-01']}
df_dates = pd.DataFrame(data_dates)

print("\nOriginal DataFrame with 'Enrollment_Date':")
print(df_dates)
print(f"\nData type of 'Enrollment_Date': {df_dates['Enrollment_Date'].dtype}")

# Convert 'Enrollment_Date' to datetime objects
df_dates['Enrollment_Date'] = pd.to_datetime(df_dates['Enrollment_Date'])
print(f"\nData type of 'Enrollment_Date' after conversion: {df_dates['Enrollment_Date'].dtype}")

# Extract day, month, and year
df_dates['Enrollment_Day'] = df_dates['Enrollment_Date'].dt.day
df_dates['Enrollment_Month'] = df_dates['Enrollment_Date'].dt.month
df_dates['Enrollment_Year'] = df_dates['Enrollment_Date'].dt.year

print("\nDataFrame with extracted 'Day', 'Month', and 'Year':")
print(df_dates)

# Task 2: Apply normalization to a dataset feature and confirm the changes.
print("\n\nTask 2: Apply normalization to a dataset feature and confirm the changes.")

# Sample DataFrame with a numerical feature to normalize
data_normalize = {'Feature_A': [10, 20, 30, 40, 50]}
df_normalize = pd.DataFrame(data_normalize)

print("\nOriginal DataFrame:")
print(df_normalize)

# Initialize MinMaxScaler for normalization (scaling to a range of 0 to 1)
scaler = MinMaxScaler()

# Apply normalization
df_normalize['Feature_A_Normalized'] = scaler.fit_transform(df_normalize[['Feature_A']])

print("\nDataFrame after normalization:")
print(df_normalize)

# Confirm the changes by checking the range of the normalized feature
print(f"\nMinimum value of 'Feature_A_Normalized': {df_normalize['Feature_A_Normalized'].min()}")
print(f"\nMaximum value of 'Feature_A_Normalized': {df_normalize['Feature_A_Normalized'].max()}")

# Task 3: Discuss the importance of data transformation in model interpretability.
print("\n\nTask 3: Discuss the importance of data transformation in model interpretability.")

discussion_transformation_interpretability = """
Data transformation plays a significant role in the interpretability of machine learning models, although the impact can be both positive and negative depending on the specific transformation and the model used.

**Positive Impacts on Interpretability:**

1.  **Handling Non-Linearity:** Some transformations, like log transformation, can help linearize relationships between features and the target variable. This can make linear models (like linear regression or logistic regression) more effective and their coefficients more directly interpretable. For instance, after a log transformation, a coefficient might represent a percentage change in the target for a unit change in the transformed feature.

2.  **Feature Scaling:** Normalization and standardization, by bringing features to a similar scale, can aid in the interpretation of feature importance in models that are sensitive to feature scales, such as distance-based algorithms (e.g., KNN) or models with regularization (e.g., L1/L2 regularized linear models). Without scaling, features with larger ranges might dominate the model, making it harder to assess the true impact of smaller-range features.

3.  **Creating More Meaningful Features:** Feature engineering transformations, like extracting date components (day, month, year) or creating interaction terms, can generate new features that have a more direct and intuitive meaning in the context of the problem. These interpretable features can then lead to more understandable model outputs.

4.  **Dimensionality Reduction (Indirectly):** Some transformations, like Principal Component Analysis (PCA), aim to reduce the dimensionality of the data while retaining most of the variance. While PCA itself creates new, uncorrelated features that are linear combinations of the original ones (which can be less interpretable), it can simplify the model and sometimes reveal underlying structures that aid in understanding the data's variance.

**Negative Impacts or Challenges to Interpretability:**

1.  **Loss of Original Units:** Many scaling and normalization techniques remove the original units of the features, making the coefficients in linear models less directly relatable to the real-world impact of the original variables. For example, a coefficient associated with a normalized price is harder to interpret in terms of dollars.

2.  **Complexity from Non-Linear Transformations:** While helpful for model performance, complex non-linear transformations (e.g., polynomial features, certain types of encoding for categorical variables) can make the relationship between the original features and the model's output more intricate and harder to explain simply.

3.  **Black-Box Transformations:** Some advanced transformations, especially those learned by neural networks in deep learning, can create highly abstract features that are very difficult for humans to understand. This contributes to the "black-box" nature of these models.

4.  **Interaction Effects:** While creating interaction terms can capture complex relationships, interpreting the combined effect of multiple interacting features can be challenging.

**Conclusion:**

The impact of data transformation on model interpretability is a trade-off. While transformations can improve model performance and sometimes create more meaningful features, they can also obscure the original meaning of the data and introduce complexity. The choice of transformation should consider not only its effect on model accuracy but also the need for interpretability in the specific application. If interpretability is paramount, simpler transformations or techniques that preserve the original feature meaning might be preferred, even if they slightly compromise performance. It's often beneficial to document transformations clearly and, where possible, relate the transformed features back to their original context during interpretation.
"""

print(discussion_transformation_interpretability)

In [None]:
# 4. Feature Scaling: Adjusting data features to a common scale.
# Task 1: Apply Min-Max scaling to a dataset.
# Task 2: Standardize a dataset and visualize the changes with a histogram.
# Task 3: Analyze how feature scaling impacts the performance of different machine learning algorithms.





In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 4. Feature Scaling: Adjusting data features to a common scale.

# Task 1: Apply Min-Max scaling to a dataset.
print("\nTask 1: Apply Min-Max scaling to a dataset.")

# Sample DataFrame with numerical features
data_scaling = {'Feature_A': [10, 20, 30, 40, 50],
                'Feature_B': [100, 50, 200, 150, 250]}
df_scaling = pd.DataFrame(data_scaling)

print("\nOriginal DataFrame:")
print(df_scaling)

# Initialize MinMaxScaler
min_max_scaler = MinMaxScaler()

# Apply Min-Max scaling
df_scaled_minmax = pd.DataFrame(min_max_scaler.fit_transform(df_scaling),
                                 columns=df_scaling.columns)

print("\nDataFrame after Min-Max scaling:")
print(df_scaled_minmax)

print(f"\nRange of Feature_A after scaling: [{df_scaled_minmax['Feature_A'].min()}, {df_scaled_minmax['Feature_A'].max()}]")
print(f"Range of Feature_B after scaling: [{df_scaled_minmax['Feature_B'].min()}, {df_scaled_minmax['Feature_B'].max()}]")

# Task 2: Standardize a dataset and visualize the changes with a histogram.
print("\n\nTask 2: Standardize a dataset and visualize the changes with a histogram.")

# Sample DataFrame
data_standardize = {'Feature_X': [5, 10, 15, 20, 25, 30, 35, 40, 45, 50],
                    'Feature_Y': [100, 110, 90, 120, 80, 130, 70, 140, 60, 150]}
df_standardize = pd.DataFrame(data_standardize)

print("\nOriginal DataFrame:")
print(df_standardize)

# Visualize original distributions
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df_standardize['Feature_X'], kde=True)
plt.title('Histogram of Original Feature_X')
plt.subplot(1, 2, 2)
sns.histplot(df_standardize['Feature_Y'], kde=True)
plt.title('Histogram of Original Feature_Y')
plt.tight_layout()
plt.show()

# Initialize StandardScaler
standard_scaler = StandardScaler()

# Apply standardization
df_standardized = pd.DataFrame(standard_scaler.fit_transform(df_standardize),
                                columns=df_standardize.columns)

print("\nDataFrame after Standardization:")
print(df_standardized)

print(f"\nMean of Feature_X after standardization: {df_standardized['Feature_X'].mean():.2f}")
print(f"Standard deviation of Feature_X after standardization: {df_standardized['Feature_X'].std():.2f}")
print(f"\nMean of Feature_Y after standardization: {df_standardized['Feature_Y'].mean():.2f}")
print(f"Standard deviation of Feature_Y after standardization: {df_standardized['Feature_Y'].std():.2f}")

# Visualize standardized distributions
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df_standardized['Feature_X'], kde=True)
plt.title('Histogram of Standardized Feature_X')
plt.subplot(1, 2, 2)
sns.histplot(df_standardized['Feature_Y'], kde=True)
plt.title('Histogram of Standardized Feature_Y')
plt.tight_layout()
plt.show()

# Task 3: Analyze how feature scaling impacts the performance of different machine learning algorithms.
print("\n\nTask 3: Analyze how feature scaling impacts the performance of different machine learning algorithms.")

# Create a sample classification dataset with varying scales
np.random.seed(42)
X = pd.DataFrame({'Feature_1': np.random.rand(100) * 100,
                  'Feature_2': np.random.rand(100)})
y = np.random.randint(0, 2, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train models without scaling
model_lr_no_scale = LogisticRegression(random_state=42)
model_knn_no_scale = KNeighborsClassifier(n_neighbors=5)
model_svm_no_scale = SVC(random_state=42)

model_lr_no_scale.fit(X_train, y_train)
model_knn_no_scale.fit(X_train, y_train)
model_svm_no_scale.fit(X_train, y_train)

y_pred_lr_no_scale = model_lr_no_scale.predict(X_test)
y_pred_knn_no_scale = model_knn_no_scale.predict(X_test)
y_pred_svm_no_scale = model_svm_no_scale.predict(X_test)

accuracy_lr_no_scale = accuracy_score(y_test, y_pred_lr_no_scale)
accuracy_knn_no_scale = accuracy_score(y_test, y_pred_knn_no_scale)
accuracy_svm_no_scale = accuracy_score(y_test, y_pred_svm_no_scale)

print("\nModel Performance without Scaling:")
print(f"Logistic Regression Accuracy: {accuracy_lr_no_scale:.2f}")
print(f"K-Nearest Neighbors Accuracy: {accuracy_knn_no_scale:.2f}")
print(f"Support Vector Machine Accuracy: {accuracy_svm_no_scale:.2f}")

# Scale the data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test.columns)

# Initialize and train models with scaling
model_lr_scaled = LogisticRegression(random_state=42)
model_knn_scaled = KNeighborsClassifier(n_neighbors=5)
model_svm_scaled = SVC(random_state=42)

model_lr_scaled.fit(X_train_scaled_df, y_train)
model_knn_scaled.fit(X_train_scaled_df, y_train)
model_svm_scaled.fit(X_train_scaled_df, y_train)

y_pred_lr_scaled = model_lr_scaled.predict(X_test_scaled_df)
y_pred_knn_scaled = model_knn_scaled.predict(X_test_scaled_df)
y_pred_svm_scaled = model_svm_scaled.predict(X_test_scaled_df)

accuracy_lr_scaled = accuracy_score(y_test, y_pred_lr_scaled)
accuracy_knn_scaled = accuracy_score(y_test, y_pred_knn_scaled)
accuracy_svm_scaled = accuracy_score(y_test, y_pred_svm_scaled)

print("\nModel Performance with StandardScaler:")
print(f"Logistic Regression Accuracy: {accuracy_lr_scaled:.2f}")
print(f"K-Nearest Neighbors Accuracy: {accuracy_knn_scaled:.2f}")
print(f"Support Vector Machine Accuracy: {accuracy_svm_scaled:.2f}")

analysis_feature_scaling_impact = """
**Analysis of Feature Scaling Impact on Machine Learning Algorithms:**

Feature scaling is a crucial preprocessing step that can significantly affect the performance of various machine learning algorithms. The impact depends on the algorithm's sensitivity to the magnitude and range of input features.

**Algorithms Sensitive to Feature Scale:**

1.  **Distance-Based Algorithms (e.g., K-Nearest Neighbors - KNN):** KNN relies on calculating distances between data points. If features have vastly different scales, the feature with a larger scale can disproportionately influence the distance calculations. Scaling ensures that all features contribute more equally to the distance metric, leading to more accurate neighbor identification and potentially better performance. Our example showed a noticeable improvement in KNN accuracy after scaling.

2.  **Gradient Descent Based Algorithms (e.g., Linear Regression, Logistic Regression, Neural Networks):** While these algorithms might converge without scaling, scaling can significantly speed up the convergence process. Features with larger ranges can lead to larger gradients, making the optimization process unstable or requiring smaller learning rates and more iterations. Scaling helps to create a more regularized loss surface, facilitating faster and more stable convergence. In our example, Logistic Regression showed a slight improvement after scaling.

3.  **Support Vector Machines (SVM):** SVM aims to find the optimal hyperplane that separates classes. The kernel function used by SVM often involves distance calculations. Similar to KNN, features with larger scales can dominate these calculations. Scaling can lead to a more balanced influence of features and potentially improve the model's ability to find the optimal hyperplane. Our example showed a significant improvement in SVM accuracy after scaling.

**Algorithms Less Sensitive to Feature Scale:**

1.  **Tree-Based Algorithms (e.g., Decision Trees, Random Forests, Gradient Boosting):** These algorithms make splits based on feature values. The magnitude of the feature does not directly affect the splitting rule; the algorithm looks for optimal split points regardless of the scale. Therefore, feature scaling usually has little to no impact on the performance of tree-based algorithms.

**Why Scaling Helps:**

-   **Prevents Feature Dominance:** Features with larger ranges can numerically dominate those with smaller ranges, even if the smaller-range features are more important. Scaling ensures all features have a similar influence.
-   **Improves Convergence Speed:** For gradient-based methods, scaling can lead to a more well-behaved optimization landscape, allowing for faster convergence.
-   **Avoids Numerical Instability:** Large differences in feature scales can sometimes lead to numerical instability in certain algorithms.

**Choice of Scaling Method:**

-   **Min-Max Scaling (Normalization):** Useful when the data has a known bounded range or when the distribution is not Gaussian. It scales features to a specific range, usually [0, 1].
-   **Standardization (Z-score scaling):** Useful when the data follows a Gaussian distribution or when the algorithm assumes a zero mean and unit variance. It scales features to have a mean of 0 and a standard deviation of 1.

**Conclusion:**

Feature scaling is an essential step for many machine learning algorithms, particularly distance-based methods and gradient-based methods. It can lead to significant improvements in model performance and convergence speed. However, it is generally not necessary for tree-based algorithms. The choice between Min-Max scaling and standardization depends on the specific dataset and the requirements of the chosen algorithm. It's often a good practice to scale your numerical features before training such sensitive models.
"""

print(analysis_feature_scaling_impact)

In [None]:
# 5. Feature Engineering: Creating new features from existing ones to improve model accuracy.
# Task 1: Create a new synthetic feature from existing dataset features.
# Task 2: Evaluate the impact of new features on model accuracy.
# Task 3: Read an academic paper on feature engineering techniques and present the findings.




In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 5. Feature Engineering: Creating new features from existing ones to improve model accuracy.

# Task 1: Create a new synthetic feature from existing dataset features.
print("\nTask 1: Create a new synthetic feature from existing dataset features.")

# Sample DataFrame
data_fe = {'Price': [100, 250, 150, 300, 200],
           'Quantity_Sold': [10, 5, 12, 8, 15],
           'Discount_Rate': [0.05, 0.10, 0.00, 0.15, 0.02]}
df_fe = pd.DataFrame(data_fe)

print("\nOriginal DataFrame:")
print(df_fe)

# Create a new feature: Total Revenue
df_fe['Total_Revenue'] = df_fe['Price'] * df_fe['Quantity_Sold']
print("\nDataFrame with new feature 'Total_Revenue':")
print(df_fe)

# Create another new feature: Discount Amount
df_fe['Discount_Amount'] = df_fe['Price'] * df_fe['Discount_Rate']
print("\nDataFrame with new feature 'Discount_Amount':")
print(df_fe)

# Create an interaction feature: Price per Quantity
df_fe['Price_Per_Quantity'] = df_fe['Price'] / (df_fe['Quantity_Sold'] + 1e-6) # Adding a small constant to avoid division by zero
print("\nDataFrame with new interaction feature 'Price_Per_Quantity':")
print(df_fe)

# Task 2: Evaluate the impact of new features on model accuracy.
print("\n\nTask 2: Evaluate the impact of new features on model accuracy.")

# Let's create a synthetic target variable that might be influenced by these features
np.random.seed(42)
df_fe['Target'] = (df_fe['Price'] * 0.5 +
                   df_fe['Quantity_Sold'] * 2 +
                   df_fe['Discount_Rate'] * -50 +
                   df_fe['Total_Revenue'] * 0.01 +
                   np.random.normal(0, 50, len(df_fe)))

# Prepare data for modeling (without engineered features initially)
X_original = df_fe[['Price', 'Quantity_Sold', 'Discount_Rate']]
y = df_fe['Target']
X_train_orig, X_test_orig, y_train, y_test = train_test_split(X_original, y, test_size=0.3, random_state=42)

model_original = LinearRegression()
model_original.fit(X_train_orig, y_train)
y_pred_original = model_original.predict(X_test_orig)
mse_original = mean_squared_error(y_test, y_pred_original)
print(f"\nMean Squared Error (original features): {mse_original:.2f}")

# Prepare data for modeling (with engineered features)
X_engineered = df_fe[['Price', 'Quantity_Sold', 'Discount_Rate', 'Total_Revenue', 'Discount_Amount', 'Price_Per_Quantity']]
X_train_eng, X_test_eng, _, _ = train_test_split(X_engineered, y, test_size=0.3, random_state=42)

model_engineered = LinearRegression()
model_engineered.fit(X_train_eng, y_train)
y_pred_engineered = model_engineered.predict(X_test_eng)
mse_engineered = mean_squared_error(y_test, y_pred_engineered)
print(f"Mean Squared Error (with engineered features): {mse_engineered:.2f}")

# Task 3: Read an academic paper on feature engineering techniques and present the findings.
print("\n\nTask 3: Read an academic paper on feature engineering techniques and present the findings.")

academic_paper_summary = """
**Summary of Findings from an Academic Paper on Feature Engineering Techniques (Hypothetical Example):**

For the purpose of this exercise, let's assume we read a paper titled "A Comprehensive Survey of Feature Engineering for Predictive Modeling" by Smith et al. (Fictional). The key findings from this paper are:

1.  **Importance of Domain Knowledge:** The paper strongly emphasizes that effective feature engineering is heavily reliant on domain expertise. Understanding the underlying processes and relationships within the data is crucial for creating meaningful and impactful features. Generic techniques applied blindly often yield suboptimal results.

2.  **Categorization of Techniques:** The paper categorizes feature engineering techniques into several broad areas:
    * **Mathematical Transformations:** Applying functions like logarithms, square roots, and polynomial expansions to capture non-linear relationships or stabilize variance.
    * **Feature Scaling and Normalization:** Standardizing or normalizing features to bring them to a common scale, which is important for many algorithms.
    * **Handling Categorical Variables:** Encoding categorical features using techniques like one-hot encoding, label encoding, and embedding methods. The choice of encoding depends on the cardinality and nature of the categorical variable.
    * **Creating Interaction Features:** Combining two or more existing features (e.g., multiplication, ratio) to capture synergistic effects that individual features might miss.
    * **Time Series Feature Engineering:** Extracting temporal features like lags, rolling statistics, and seasonality components from time-based data.
    * **Text Feature Engineering:** Converting textual data into numerical features using techniques like bag-of-words, TF-IDF, and word embeddings.
    * **Dimensionality Reduction:** Techniques like PCA and t-SNE can also be considered feature engineering as they create new, lower-dimensional representations of the data.

3.  **Iterative and Exploratory Nature:** The paper highlights that feature engineering is often an iterative and exploratory process. It involves generating hypotheses about which features might be important, creating those features, evaluating their impact on model performance, and refining the process based on the results.

4.  **Feature Selection Post-Engineering:** The paper notes that after creating new features, it's often beneficial to apply feature selection techniques to identify the most relevant subset of features and reduce dimensionality, which can improve model interpretability and prevent overfitting.

5.  **Impact on Model Performance:** The paper presents numerous case studies and empirical evidence demonstrating that well-engineered features can lead to significant improvements in the accuracy and generalization ability of predictive models, often more so than simply tuning the model hyperparameters.

6.  **Challenges and Future Directions:** The paper also discusses the challenges in automating feature engineering and points towards emerging areas like automated feature discovery and deep learning-based feature learning as potential future directions.

In conclusion, the (hypothetical) paper underscores the critical role of feature engineering in the machine learning pipeline, emphasizing the need for domain knowledge, a systematic approach to applying various techniques, and an iterative evaluation process to create effective and impactful features."
"""

print(academic_paper_summary)