Question 1 : What is the difference between AI, ML, DL, and Data Science? Provide a brief explanation of each.

Artificial Intelligence (AI) is the broadest field, representing the theory and development of computer systems capable of performing tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation..Its scope is to create intelligent agents that can reason, learn, and act autonomously.

Machine Learning (ML) is a subset of AI.It is a technique that enables systems to automatically learn and improve from experience without being explicitly programmed.The core technique involves developing algorithms that can be trained on data to make predictions or decisions.

Deep Learning (DL) is a specialized sub-field of ML.It employs neural networks with multiple hidden layers (hence, "deep") to model complex patterns in data.DL's techniques, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), are primarily used in complex tasks like image recognition and natural language processing.

Finally, Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data..Its scope is highly application-focused, blending statistics, computer science, and domain expertise to solve real-world problems, often utilizing ML and DL techniques as tools in the process.


Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent them?

Overfitting occurs when a model learns the training data too well, including the noise and random fluctuations, to the point where it performs poorly on new, unseen data..An overfit model has high variance and low bias.It is detected when the model shows very high accuracy on the training set but significantly lower accuracy on the validation or test set.

Underfitting occurs when a model is too simple to capture the underlying structure or pattern in the training data, resulting in poor performance on both the training and test sets..An underfit model has high bias and high variance.It is detected when both the training and test accuracy scores are low.

To prevent and detect these issues, one key technique is to manage the bias-variance tradeoff.Overfitting can be prevented by regularization techniques (like L1 or L2 regularization) which penalize complexity, by gathering more training data, or by simplifying the model..Underfitting is prevented by using a more complex model, introducing more relevant features, or reducing regularization.Cross-validation (eg, K-Fold) is a primary method for detection, as it allows for testing the model's performance on multiple subsets of the data, providing a more robust estimate of its generalization error.

Question 3: How would you handle missing values â€‹in a dataset? Explain at least three methods with examples.

Handling missing values â€‹â€‹is a critical step in data preprocessing, and the chosen method depends on the extent and nature of the missingness..


Deletion: This involves either Listwise Deletion (removing the entire row/sample that contains any missing value) or Pairwise Deletion (using only the available data for a specific analysis). For example, if a dataset has 5% of rows with missing 'Age' values, a data scientist might remove these 5% of rows (Listwise Deletion). This method is straightforward and used when the missing data is minimal and assumed to be Missing Completely at Random (MCAR).However, it risks losing valuable data and introducing bias if the missingness is not random.


Imputation with Mean/Median: This technique involves replacing the missing value with the mean (for symmetrically distributed numerical data) or the median (for skewed numerical data) of the existing non-missing values â€‹â€‹in that feature..For instance, in a dataset of house prices where some 'Square Footage' entries are missing, one would calculate the median square footage of all known houses and use that value to fill the gaps..While simple, this method does not account for the variance and can lead to a less accurate model.


Predictive Modeling (eg, Regression Imputation): This advanced technique treats the feature with missing values â€‹â€‹as the target variable and uses other features in the dataset as predictors to estimate the missing values.. For example, an analyst could use a Linear Regression model to predict a missing 'Salary' based on the 'Years of Experience' and 'Job Title' features, effectively using a sub-model to fill in the missing data.This method preserves the relationships between variables better than simple mean/median imputation but is more complex and time-consuming to implement.

Question 4: What is an imbalanced dataset? Describe two techniques to handle it
(theoretical + practical).

An unbalanced dataset is one where the distribution of the target variable's classes is not approximately equal, meaning one class (the majority class ) has significantly more observations than the other class(es) (the minority class )..This imbalance can cause a machine learning model to be biased towards the majority class, resulting in poor performance (eg, high accuracy but low recall for the minority class).

Two techniques to handle unbalanced datasets are:

Synthetic Minority Over-sampling Technique (SMOTE):


Theoretical: SMOTE is an oversampling approach where synthetic samples are created for the minority class. It works by selecting a minority class sample and then choosing one or more of its k -nearest neighbors.A new synthetic instance is then created along the line segments joining the sample and its neighbors, rather than simply duplicating existing minority samples..

Practical: Using the Python library imblearn, a data scientist would apply SMOTE()to the training data. This balances the class distribution, helping the model learn the characteristics of the minority class more effectively.

Using Class Weights in Models:


Theoretical: This technique does not alter the dataset itself but modifies the learning algorithm.By assigning higher weight to the minority class samples and lower weight to the majority class samples, the model is penalized more heavily for incorrect classifications of the minority class during training..This forces the model to pay more attention to the under-represented examples.

Practical: Many ML algorithms, such as Logistic Regression and Support Vector Machines (SVMs) in scikit-learn, have a class_weightparameter that can be set to 'balanced'or manually specified.Setting it to 'balanced'automatically computes weights inversely proportional to class frequencies, directly addressing the imbalance during the training process.

Question 5: Why is feature scaling important in ML?
Compare Min-Max scaling and Standardization.

Feature scaling is an essential preprocessing step in machine learning that involves transforming the range of independent features to a standardized range..It is crucial because many ML algorithms rely on calculating the distance between data points or depend on the magnitude of feature values.Without scaling, features with larger initial magnitudes (eg, 'Salary' in the tens of thousands) would dominate the distance calculation over features with smaller magnitudes (eg, 'Years of Experience' from 1 to 20), leading to a model that is heavily biased towards the larger-scale features..This impact is particularly significant for distance-based algorithms like K-Nearest Neighbors (KNN) and Support Vector Machines (SVM)..Furthermore, for algorithms that rely on iterative optimization like Gradient Descent (used in Linear Regression and Neural Networks), unscaled features can cause the optimization path to oscillate wildly, resulting in slower convergence or failure to find the optimal solution.

Comparison of Min-Max Scaling and Standardization
Min-Max Scaling (Normalization) and Standardization (Z-Score Normalization) are two common techniques used in machine learning to scale features.

Min-Max Scaling (Normalization)Min-Max Scaling transforms the data such that all feature values â€‹â€‹fall within a specific fixed range, typically [0, 1] . The formula used for this transformation is$X_{new} = \frac{X - X_{min}}{X_{max} - X_{min}}$. This technique is generally preferred when the data distribution is not Gaussian or when clear boundaries, such as minimum and maximum values, are required, as is often the case in image processing. A significant drawback, however, is that Min-Max Scaling is highly sensitive to outliers . Since the transformation is based on the maximum and minimum values, outliers can drastically compress the range of the majority of the data points, reducing the effective distinction between non-outlier values.

Standardization (Z-Score Normalization)Standardization rescales the data so that the resulting distribution has a mean ($\mu$) of 0 and a standard deviation ($\sigma$) of 1 . The formula for Standardization is$X_{new} = \frac{X - \mu}{\sigma}$. This method is preferred when the feature distribution is approximately Gaussian (normal) or when the model requires normally distributed inputs, such as in Linear Discriminant Analysis. Unlike Min-Max Scaling, Standardization is less affected by outliers because the transformation does not constrain the data to a fixed, predefined range.

Question 6: Compare Label Encoding and One-Hot Encoding.
When would you prefer one over the other?

Label Encoding and One-Hot Encoding are methods used to convert categorical variables into a numerical format that machine learning algorithms can process..


Label Encoding assigns a unique integer to each category.For a feature like 'Color' with categories 'Red', 'Blue', and 'Green', it might be encoded as 0, 1, and 2, respectively..


One-Hot Encoding creates a new binary feature (column) for each unique category in the original feature. In the 'Color' example, it would create three new columns: 'Color_Red', 'Color_Blue', and 'Color_Green'.For a data point that is 'Red', the 'Color_Red' column would be 1, and the others would be 0.

Label Encoding Preference
Label Encoding is preferred when dealing with Ordinal Variables . An ordinal variable is one where the categories have an inherent order or ranking (an ordinal relationship). A good example is 'Education Level,' where categories naturally progress (eg, High School < Bachelor's < Master's). By assigning sequential integers (eg, 0, 1, 2) to these categories, Label Encoding naturally captures this relationship, making the encoded variable useful for the machine learning model.

One-Hot Encoding Preference
One-Hot Encoding is preferred for Nominal Variables . A nominal variable is a categorical variable that has no intrinsic order (a nominal relationship), such as 'City' or 'Gender'. In this case, using Label Encoding would mistakenly introduce a false sense of ordinality (eg, implying 'City 2' is somehow 'greater' than 'City 1'), which can confuse and mislead the model. One-Hot Encoding avoids this problem by creating separate binary columns for each category, ensuring that all categories are treated as independent entities.

a). Analyze the relationship between app categories and ratings. Which categories have the highest/lowest average ratings, and what could be the possible reasons?

Analytical Steps
Here is the Python code structure and logic you need to execute the analysis:

In [1]:
import pandas as pd

# Initialize df to None
df = None

# 1. Data Loading (Assuming you have downloaded the CSV file, e.g., 'googleplaystore.csv')
# You must first download the dataset from the provided GitHub link.
try:
    df = pd.read_csv('googleplaystore.csv')
except FileNotFoundError:
    print("Error: The CSV file was not found. Please download it from the specified GitHub repository.")
    # The program will not exit here, but will proceed with df = None
    # Subsequent code will only run if df was successfully loaded.

# Only proceed with data cleaning and analysis if the DataFrame was loaded successfully
if df is not None:
    # 2. Data Cleaning and Preparation
    # The 'Rating' column is critical and contains missing values and potentially non-numeric entries.

    # Drop rows where 'Rating' is missing, as imputation can bias category means.
    df.dropna(subset=['Rating'], inplace=True)

    # Ensure 'Rating' is numeric (it may already be, but good practice).
    df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
    df.dropna(subset=['Rating'], inplace=True)

    # 3. Analyze Relationship between Category and Average Rating
    # Group the data by 'Category' and calculate the mean of the 'Rating'.
    category_ratings = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)

    # 4. Identify Highest and Lowest Rated Categories
    top_5_categories = category_ratings.head(5)
    bottom_5_categories = category_ratings.tail(5)

    # 5. Output the Results
    print("## Categories with the Highest Average Ratings")
    print(top_5_categories.to_string())

    print("\n## Categories with the Lowest Average Ratings")
    print(bottom_5_categories.to_string())
else:
    print("Data loading failed. Please ensure 'googleplaystore.csv' is available.")

Error: The CSV file was not found. Please download it from the specified GitHub repository.
Data loading failed. Please ensure 'googleplaystore.csv' is available.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Question 8: Titanic Dataset
a) Compare the survival rates based on passenger class (Pclass). Which class had the highestsurvival rate, and why do you think that happened?

The analysis of the Titanic dataset is a classic task in exploratory data analysis (EDA), revealing key demographic and social factors influencing survival. Since I cannot execute the Python code or access the external dataset, I will provide the detailed analytical steps (Python code and logic) you need and the expected findings and reasoning based on historical context.

ðŸš¢ Analytical Steps (Python Code Outline)
Here is the Python code structure and logic required to answer Question 8.


In [3]:
import pandas as pd
import numpy as np

# Initialize df to None
df = None

# 1. Data Loading from Google Drive
try:
    df = pd.read_csv('/content/drive/MyDrive/titanic.csv')
except FileNotFoundError:
    print("Error: The 'titanic.csv' file was not found in your Google Drive. Please ensure it's there and the path is correct.")
except Exception as e:
    print(f"An unexpected error occurred while loading the data: {e}")

# Only proceed with data cleaning and analysis if the DataFrame was loaded successfully
if df is not None:
    # 2. Data Cleaning and Preparation

    # For Pclass analysis (Part a), Pclass and Survived are generally clean, but always check for missing values.
    df.dropna(subset=['Pclass', 'Survived'], inplace=True)

    # For Age analysis (Part b), handle missing 'Age' values. Imputation (like median) is common,
    # but for simple survival rate comparison, dropping missing ages is often acceptable to keep groups pure.
    age_df = df.dropna(subset=['Age', 'Survived']).copy()

    ## Part a: Survival Rate by Pclass
    print("--- Part a: Survival Rate by Passenger Class (Pclass) ---")

    # Group by Pclass and calculate the mean of the 'Survived' column (1=Survived, 0=Died)
    # This mean represents the survival rate.
    pclass_survival = df.groupby('Pclass')['Survived'].mean().sort_values(ascending=False)
    pclass_survival = (pclass_survival * 100).round(2) # Convert to percentage

    print("\nPassenger Class Survival Rates:")
    print(pclass_survival.to_string() + ' %')

    print("\n" + "="*50 + "\n")

    ## Part b: Survival Rate by Age Group
    print("--- Part b: Survival Rate by Age Group (Child vs. Adult) ---")

    # Create the Age Group feature: Child (Age < 18) and Adult (Age >= 18)
    age_df['Age_Group'] = np.where(age_df['Age'] < 18, 'Child', 'Adult')

    # Group by the new 'Age_Group' and calculate the mean survival rate
    age_survival = age_df.groupby('Age_Group')['Survived'].mean().sort_values(ascending=False)
    age_survival = (age_survival * 100).round(2) # Convert to percentage

    print("\nAge Group Survival Rates:")
    print(age_survival.to_string() + ' %')
else:
    print("Data loading failed. Please ensure 'titanic.csv' is available and the path is correct.")

Error: The 'titanic.csv' file was not found in your Google Drive. Please ensure it's there and the path is correct.
Data loading failed. Please ensure 'titanic.csv' is available and the path is correct.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


After executing the above cell, you'll be prompted to authorize Colab to access your Google Drive. Follow the instructions to complete the mounting process. Once mounted, you can place your `titanic.csv` file into your Google Drive (e.g., in 'My Drive') and then access it from Colab. For example, if it's directly in 'My Drive', you would load it like this:

```python
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/titanic.csv')
```

If you place it in a subfolder like `Colab Notebooks`, it would be:

```python
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/titanic.csv')
```

Question 9: Flight Price Prediction Dataset
a) How do flight prices vary with the days left until departure? Identify any exponential price
surges and recommend the best booking window.
b)Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are
consistently cheaper/premium, and why?



In [3]:
import pandas as pd
import numpy as np

# Initialize df to None
df = None

# 1. Data Loading and Cleaning (Assuming the dataset is downloaded)
try:
    df = pd.read_csv('Flight_Price_Prediction.csv')
except FileNotFoundError:
    print("Error: The CSV file was not found. Please download it from the specified GitHub repository.")
    # Do not exit, allow subsequent code to handle df being None
except Exception as e:
    print(f"An unexpected error occurred while loading the data: {e}")

# Only proceed with data cleaning and analysis if the DataFrame was loaded successfully
if df is not None:
    # Convert date columns to datetime objects
    df['Date_of_Journey'] = pd.to_datetime(df['Date_of_Journey'], format='%d/%m/%Y', errors='coerce')
    df['Date_of_Booking'] = pd.to_datetime(df['Date_of_Booking'], format='%d/%m/%Y', errors='coerce')
    df.dropna(subset=['Date_of_Journey', 'Date_of_Booking', 'Price'], inplace=True)


    ## Part a: Price Variation vs. Days Left Until Departure
    print("--- Part a: Price Variation vs. Days Left Until Departure ---")

    # Calculate Days Left Until Departure
    df['Days_Left'] = (df['Date_of_Journey'] - df['Date_of_Booking']).dt.days

    # Group by Days_Left and calculate the median price (median is robust to outliers)
    price_vs_days_left = df.groupby('Days_Left')['Price'].median().reset_index()

    # Sort by days left and print the results for observation
    print("\nMedian Price by Days Left (Top 5 and Bottom 5):")
    print(price_vs_days_left.sort_values(by='Days_Left', ascending=False).head(5).to_string())
    print(price_vs_days_left.sort_values(by='Days_Left', ascending=True).head(5).to_string())

    # Typically, you would visualize this using:
    # import seaborn as sns
    # sns.lineplot(data=price_vs_days_left, x='Days_Left', y='Price')


    ## Part b: Price Comparison Across Airlines for a Specific Route (Delhi-Mumbai)
    print("\n--- Part b: Price Comparison Across Airlines for Delhi-Mumbai Route ---")

    # Filter the dataset for the specific route
    delhi_mumbai_df = df[(df['Source'] == 'Delhi') & (df['Destination'] == 'Mumbai')].copy()

    # Group by 'Airline' and calculate the mean price
    airline_price_comparison = delhi_mumbai_df.groupby('Airline')['Price'].mean().sort_values()

    print("\nAverage Price Comparison for Delhi-Mumbai Route:")
    print(airline_price_comparison.to_string())

    # Typically, you would visualize this using:
    # sns.barplot(x=airline_price_comparison.index, y=airline_price_comparison.values)
else:
    print("Data loading failed. Please ensure 'Flight_Price_Prediction.csv' is available.")

Error: The CSV file was not found. Please download it from the specified GitHub repository.
Data loading failed. Please ensure 'Flight_Price_Prediction.csv' is available.


Question 10: HR Analytics Dataset
a). What factors most strongly correlate with employee attrition? Use visualizations to show key
drivers (e.g., satisfaction, overtime, salary).
b). Are employees with more projects more likely to leave?



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Data Loading (You must download the hr_analytics dataset first)
try:
    df = pd.read_csv('hr_analytics.csv')
except FileNotFoundError:
    print("Error: The CSV file was not found. Please ensure it is in your working directory.")
    exit()

# Ensure 'Attrition' is numeric for correlation (if it's 'Yes'/'No', convert it)
# Assuming 'Attrition' is 1 for Yes (Left) and 0 for No (Stayed)
df['Attrition_Numeric'] = df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)

## Part a: Correlation Analysis (Strongest Drivers)
print("--- Part a: Factors Correlating with Employee Attrition ---")

# Convert high-impact categorical features like 'Overtime' and 'Salary' into numerical format
# Overtime: Convert 'Yes' to 1 and 'No' to 0 for correlation
df['Overtime_Numeric'] = df['Overtime'].apply(lambda x: 1 if x == 'Yes' else 0)

# Salary: If salary is tiered (e.g., 'Low', 'Medium', 'High'), use Label Encoding (e.g., 1, 2, 3)
# (Adjust this section based on the actual 'Salary' column name and values in your dataset)
# Example: df['Salary_Numeric'] = df['Salary'].map({'Low': 1, 'Medium': 2, 'High': 3})

# Calculate the correlation matrix for key features
key_features = ['Attrition_Numeric', 'Satisfaction', 'Overtime_Numeric', 'Salary_Numeric', 'Number_of_Projects']
correlation_matrix = df[key_features].corr()

# Extract correlation of Attrition with other factors
attrition_corr = correlation_matrix['Attrition_Numeric'].sort_values(ascending=False).drop('Attrition_Numeric')

print("\nCorrelation of Key Factors with Attrition:")
print(attrition_corr.to_string())

# Visualization for Categorical Driver (Overtime)
plt.figure(figsize=(6, 4))
sns.barplot(x='Overtime', y='Attrition_Numeric', data=df)
plt.title('Attrition Rate by Overtime Status')
plt.ylabel('Attrition Rate (Mean)')
plt.show() #

## Part b: Projects vs. Attrition
print("\n--- Part b: Projects vs. Attrition ---")

# Calculate the attrition rate for each number of projects
projects_attrition = df.groupby('Number_of_Projects')['Attrition_Numeric'].mean().reset_index()

# Visualization
plt.figure(figsize=(8, 5))
sns.barplot(x='Number_of_Projects', y='Attrition_Numeric', data=projects_attrition)
plt.title('Attrition Rate by Number of Projects')
plt.ylabel('Attrition Rate (Mean)')
plt.xlabel('Number of Projects')
plt.show() #

print("\nAttrition Rate by Number of Projects:")
print(projects_attrition.to_string())