

---

#  **EDA – 1: Bike Details Dataset**



### **1. What is the range of selling prices in the dataset?**

```python
selling_price_range = df['selling_price'].agg(['min', 'max'])
print("Selling Price Range:")
print(selling_price_range)
```

---

### **2. What is the median selling price for bikes in the dataset?**

```python
median_price = df['selling_price'].median()
print("Median Selling Price:", median_price)
```

---

### **3. What is the most common seller type?**

```python
most_common_seller = df['seller_type'].mode()[0]
print("Most Common Seller Type:", most_common_seller)
```

---

### **4. How many bikes have driven more than 50,000 kilometers?**

```python
count_high_kms = df[df['km_driven'] > 50000].shape[0]
print("Bikes Driven Over 50,000 KM:", count_high_kms)
```

---

### **5. What is the average km_driven value for each ownership type?**

```python
avg_kms_per_owner = df.groupby('owner')['km_driven'].mean()
print("Average KM Driven per Ownership Type:")
print(avg_kms_per_owner)
```

---

### **6. What proportion of bikes are from the year 2015 or older?**

```python
prop_2015_older = (df[df['year'] <= 2015].shape[0]) / df.shape[0]
print("Proportion of bikes from 2015 or older:", prop_2015_older)
```

---

### **7. What is the trend of missing values across the dataset?**

```python
import seaborn as sns
import matplotlib.pyplot as plt

missing = df.isnull().sum()
missing_percent = df.isnull().mean() * 100

plt.figure(figsize=(10, 5))
sns.barplot(x=missing_percent.index, y=missing_percent.values)
plt.xticks(rotation=45)
plt.ylabel('Percentage of Missing Values')
plt.title('Missing Value Trend in Dataset')
plt.show()
```

---

### **8. What is the highest ex_showroom_price recorded, and for which bike?**

```python
max_price_row = df.loc[df['ex_showroom_price'].idxmax()]
print("Bike with Highest Ex-Showroom Price:")
print(max_price_row[['name', 'ex_showroom_price']])
```

---

### **9. What is the total number of bikes listed by each seller type?**

```python
seller_counts = df['seller_type'].value_counts()
print("Total Bikes by Seller Type:")
print(seller_counts)
```

---

### **10. What is the relationship between selling_price and km_driven for first-owner bikes?**

```python
import seaborn as sns

first_owner = df[df['owner'] == '1st owner']
sns.scatterplot(x='km_driven', y='selling_price', data=first_owner)
plt.title('Selling Price vs. KM Driven (1st Owner Bikes)')
plt.xlabel('KM Driven')
plt.ylabel('Selling Price')
plt.show()
```

---

### **11. Identify and remove outliers in the km_driven column using the IQR method**

```python
Q1 = df['km_driven'].quantile(0.25)
Q3 = df['km_driven'].quantile(0.75)
IQR = Q3 - Q1

# Filter out outliers
df_no_outliers = df[(df['km_driven'] >= Q1 - 1.5 * IQR) & (df['km_driven'] <= Q3 + 1.5 * IQR)]
print("Shape before:", df.shape)
print("Shape after removing outliers:", df_no_outliers.shape)
```

---

### **12. Perform a bivariate analysis to visualize the relationship between year and selling_price**

```python
sns.boxplot(x='year', y='selling_price', data=df)
plt.xticks(rotation=90)
plt.title('Year vs. Selling Price')
plt.xlabel('Year')
plt.ylabel('Selling Price')
plt.show()
```

---

### **13. What is the average depreciation in selling price based on the bike's age (current year - manufacturing year)?**

```python
from datetime import datetime

current_year = datetime.now().year
df['age'] = current_year - df['year']
df['depreciation'] = df['ex_showroom_price'] - df['selling_price']
avg_depreciation_by_age = df.groupby('age')['depreciation'].mean()
print("Average Depreciation by Age:")
print(avg_depreciation_by_age)
```

---

### **14. Which bike names are priced significantly above the average price for their manufacturing year?**

```python
avg_price_by_year = df.groupby('year')['selling_price'].mean().reset_index()
merged = df.merge(avg_price_by_year, on='year', suffixes=('', '_year_avg'))

# Threshold: 50% higher than average for the year
significantly_priced = merged[merged['selling_price'] > 1.5 * merged['selling_price_year_avg']]
print("Bikes Priced Significantly Above Yearly Average:")
print(significantly_priced[['name', 'year', 'selling_price', 'selling_price_year_avg']])
```

---

### **15. Develop a correlation matrix for numeric columns and visualize it using a heatmap**

```python
import seaborn as sns
import matplotlib.pyplot as plt

numeric_cols = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_cols.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap of Numeric Features")
plt.show()
```

---


# **EDA – 2: Car Sales Dataset**



---

### **1. What is the average selling price of cars for each dealer, and how does it compare across different dealers?**

```python
avg_price_per_dealer = df.groupby('Dealer_Name')['Price ($)'].mean()
print("Average Selling Price for Each Dealer:")
print(avg_price_per_dealer)
```

---

### **2. Which car brand (Company) has the highest variation in prices, and what does this tell us about the pricing trends?**

```python
price_variation_per_brand = df.groupby('Company')['Price ($)'].std()
max_variation_brand = price_variation_per_brand.idxmax()
max_variation_value = price_variation_per_brand.max()

print("Car Brand with the Highest Price Variation:")
print(f"Brand: {max_variation_brand}, Price Variation: {max_variation_value}")
```

---

### **3. What is the distribution of car prices for each transmission type, and how do the interquartile ranges compare?**

```python
sns.boxplot(x='Transmission', y='Price ($)', data=df)
plt.title('Car Prices by Transmission Type')
plt.xlabel('Transmission Type')
plt.ylabel('Price ($)')
plt.show()
```

---

### **4. What is the distribution of car prices across different regions?**

```python
sns.boxplot(x='Dealer_Region', y='Price ($)', data=df)
plt.title('Car Prices Across Different Regions')
plt.xlabel('Dealer Region')
plt.ylabel('Price ($)')
plt.show()
```

---

### **5. What is the distribution of cars based on body styles?**

```python
body_style_counts = df['Body Style'].value_counts()
sns.barplot(x=body_style_counts.index, y=body_style_counts.values)
plt.title('Distribution of Cars Based on Body Style')
plt.xlabel('Body Style')
plt.ylabel('Number of Cars')
plt.show()
```

---

### **6. How does the average selling price of cars vary by customer gender and annual income?**

```python
avg_price_by_gender_income = df.groupby(['Gender', 'Annual Income'])['Price ($)'].mean().unstack()
print("Average Car Price by Gender and Annual Income:")
print(avg_price_by_gender_income)
```

---

### **7. What is the distribution of car prices by region, and how does the number of cars sold vary by region?**

```python
sns.boxplot(x='Dealer_Region', y='Price ($)', data=df)
plt.title('Car Prices by Region')
plt.xlabel('Dealer Region')
plt.ylabel('Price ($)')
plt.show()

# Number of cars sold by region
region_sales_count = df['Dealer_Region'].value_counts()
print("Number of Cars Sold by Region:")
print(region_sales_count)
```

---

### **8. How does the average car price differ between cars with different engine sizes?**

```python
avg_price_by_engine = df.groupby('Engine')['Price ($)'].mean()
print("Average Car Price by Engine Size:")
print(avg_price_by_engine)
```

---

### **9. How do car prices vary based on the customer’s annual income bracket?**

```python
# Create income brackets
income_brackets = pd.cut(df['Annual Income'], bins=[0, 30000, 60000, 90000, 120000, 150000], labels=["<30k", "30k-60k", "60k-90k", "90k-120k", "120k+"])
df['Income Bracket'] = income_brackets

avg_price_by_income_bracket = df.groupby('Income Bracket')['Price ($)'].mean()
print("Average Car Price by Income Bracket:")
print(avg_price_by_income_bracket)
```

---

### **10. What are the top 5 car models with the highest number of sales, and how does their price distribution look?**

```python
top_5_models = df['Model'].value_counts().head(5)
print("Top 5 Car Models with the Highest Sales:")
print(top_5_models)

# Price distribution for top 5 models
top_5_models_data = df[df['Model'].isin(top_5_models.index)]
sns.boxplot(x='Model', y='Price ($)', data=top_5_models_data)
plt.title('Price Distribution for Top 5 Car Models')
plt.xlabel('Car Model')
plt.ylabel('Price ($)')
plt.xticks(rotation=45)
plt.show()
```

---

### **11. How does car price vary with engine size across different car colors, and which colors have the highest price variation?**

```python
sns.boxplot(x='Engine', y='Price ($)', hue='Color', data=df)
plt.title('Car Price Variation by Engine Size and Color')
plt.xlabel('Engine Type')
plt.ylabel('Price ($)')
plt.show()

# Price variation by color
price_variation_by_color = df.groupby('Color')['Price ($)'].std()
max_price_variation_color = price_variation_by_color.idxmax()
max_price_variation_value = price_variation_by_color.max()

print(f"Color with Highest Price Variation: {max_price_variation_color}, Variation: {max_price_variation_value}")
```

---

### **12. Is there any seasonal trend in car sales based on the date of sale?**

```python
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month
monthly_sales = df.groupby('Month').size()

sns.lineplot(x=monthly_sales.index, y=monthly_sales.values)
plt.title('Car Sales Trend by Month')
plt.xlabel('Month')
plt.ylabel('Number of Sales')
plt.show()
```

---

### **13. How does the car price distribution change when considering different combinations of body style and transmission type?**

```python
sns.boxplot(x='Body Style', y='Price ($)', hue='Transmission', data=df)
plt.title('Car Price Distribution by Body Style and Transmission Type')
plt.xlabel('Body Style')
plt.ylabel('Price ($)')
plt.show()
```

---

### **14. What is the correlation between car price, engine size, and annual income of customers, and how do these features interact?**

```python
correlation_matrix = df[['Price ($)', 'Engine', 'Annual Income']].corr()
print("Correlation Matrix between Price, Engine Size, and Annual Income:")
print(correlation_matrix)

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap between Price, Engine Size, and Annual Income')
plt.show()
```

---

### **15. How does the average car price vary across different car models and engine types?**

```python
avg_price_by_model_engine = df.groupby(['Model', 'Engine'])['Price ($)'].mean().unstack()
print("Average Car Price by Model and Engine Type:")
print(avg_price_by_model_engine)
```



---

# **EDA – 3: Amazon Sales Data**



---

### **1. What is the average rating for each product category?**

```python
avg_rating_by_category = df.groupby('category')['rating'].mean()
print("Average Rating by Product Category:")
print(avg_rating_by_category)
```

---

### **2. What are the top rating_count products by category?**

```python
top_rating_count_by_category = df.groupby('category').apply(lambda x: x.nlargest(1, 'rating_count'))
print("Top Products by Rating Count in Each Category:")
print(top_rating_count_by_category[['product_name', 'rating_count']])
```

---

### **3. What is the distribution of discounted prices vs. actual prices?**

```python
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.histplot(df['discounted_price'], color='blue', kde=True, label='Discounted Price', bins=30)
sns.histplot(df['actual_price'], color='red', kde=True, label='Actual Price', bins=30)
plt.legend()
plt.title('Distribution of Discounted Price vs Actual Price')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
```

---

### **4. How does the average discount percentage vary across categories?**

```python
avg_discount_by_category = df.groupby('category')['discount_percentage'].mean()
print("Average Discount Percentage by Category:")
print(avg_discount_by_category)
```

---

### **5. What are the most popular product names?**

```python
most_popular_products = df['product_name'].value_counts().head(10)
print("Most Popular Product Names:")
print(most_popular_products)
```

---

### **6. What are the most popular product keywords?**

```python
from collections import Counter
import re

# Extracting keywords from product names and reviews
all_words = ' '.join(df['product_name'].dropna()).lower()
words = re.findall(r'\w+', all_words)
word_counts = Counter(words)

# Displaying top 10 popular keywords
top_keywords = word_counts.most_common(10)
print("Most Popular Product Keywords:")
print(top_keywords)
```

---

### **7. What are the most popular product reviews?**

```python
most_popular_reviews = df[['review_title', 'review_content']].dropna()
print("Most Popular Product Reviews:")
print(most_popular_reviews.head(10))
```

---

### **8. What is the correlation between discounted_price and rating?**

```python
correlation = df[['discounted_price', 'rating']].corr()
print("Correlation between Discounted Price and Rating:")
print(correlation)

# Visualizing the correlation
sns.scatterplot(x='discounted_price', y='rating', data=df)
plt.title('Discounted Price vs Rating')
plt.xlabel('Discounted Price')
plt.ylabel('Rating')
plt.show()
```

---

### **9. What are the Top 5 categories based on the highest ratings?**

```python
top_5_categories_by_rating = df.groupby('category')['rating'].mean().nlargest(5)
print("Top 5 Categories by Average Rating:")
print(top_5_categories_by_rating)
```

---

### **10. Identify any potential areas for improvement or optimization based on the data analysis**

Here are some insights and possible areas for improvement:

- **Low-rated products with high rating count:** If there are products with a lot of reviews but low ratings, this might indicate customer dissatisfaction. Improving these products can help enhance the brand image.
  
- **Products with high discount percentage but low rating:** Products that are heavily discounted yet have low ratings may be causing customers to doubt their quality. Offering higher quality or adjusting marketing could be beneficial.
  
- **Underperforming categories:** Categories with lower average ratings or lower sales volume might need better product curation or more targeted marketing.

- **Price distribution analysis:** If many products have a significant difference between discounted price and actual price, it could indicate ineffective pricing strategies. Evaluating the discounts and improving the pricing strategy may help optimize sales.



---

# **EDA – 4: Spotify Data: Popular Hip-Hop Artists and Tracks**



---

### **1. Read the dataframe, check for null values if present, then do the needful, check for duplicate rows, if present then do the needful**

```python
import pandas as pd

# Load dataset
df = pd.read_csv('spotify_hiphop_tracks.csv')

# Check for null values
null_values = df.isnull().sum()
print("Null Values in Each Column:")
print(null_values)

# Fill or drop null values as needed (example: filling with the mean for numerical columns)
df.fillna(df.mean(), inplace=True)

# Check for duplicates
duplicate_rows = df.duplicated().sum()
print(f"Number of Duplicate Rows: {duplicate_rows}")

# Remove duplicate rows if any
df.drop_duplicates(inplace=True)
```

---

### **2. What is the distribution of popularity among the tracks in the dataset? Visualize it using a histogram**

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the distribution of popularity
plt.figure(figsize=(10, 6))
sns.histplot(df['Popularity'], kde=True, bins=20, color='purple')
plt.title('Distribution of Track Popularity')
plt.xlabel('Popularity')
plt.ylabel('Frequency')
plt.show()
```

---

### **3. Is there any relationship between the popularity and the duration of tracks? Explore this using a scatter plot**

```python
# Scatter plot between popularity and track duration
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Duration (ms)', y='Popularity', data=df, color='blue')
plt.title('Popularity vs Duration of Tracks')
plt.xlabel('Duration (ms)')
plt.ylabel('Popularity')
plt.show()
```

---

### **4. Which artist has the highest number of tracks in the dataset? Display the count of tracks for each artist using a countplot**

```python
# Count the number of tracks per artist
plt.figure(figsize=(12, 6))
sns.countplot(x='Artist', data=df, order=df['Artist'].value_counts().index, palette='viridis')
plt.title('Track Count per Artist')
plt.xlabel('Artist')
plt.ylabel('Number of Tracks')
plt.xticks(rotation=90)
plt.show()

# Top artist with the highest number of tracks
top_artist = df['Artist'].value_counts().idxmax()
print(f"Artist with the Highest Number of Tracks: {top_artist}")
```

---

### **5. What are the top 5 least popular tracks in the dataset? Provide the artist name and track name for each**

```python
# Get the top 5 least popular tracks
least_popular_tracks = df.nsmallest(5, 'Popularity')[['Artist', 'Track Name', 'Popularity']]
print("Top 5 Least Popular Tracks:")
print(least_popular_tracks)
```

---

### **6. Among the top 5 most popular artists, which artist has the highest popularity on average? Calculate and display the average popularity for each artist**

```python
# Top 5 most popular artists based on track popularity
top_5_artists = df.groupby('Artist')['Popularity'].mean().nlargest(5)

# Find the artist with the highest average popularity
highest_avg_popularity_artist = top_5_artists.idxmax()
highest_avg_popularity = top_5_artists.max()

print(f"Artist with the Highest Average Popularity: {highest_avg_popularity_artist} (Avg Popularity: {highest_avg_popularity})")
```

---

### **7. For the top 5 most popular artists, what are their most popular tracks? List the track name for each artist**

```python
# Get the top 5 most popular artists
top_5_artists = df.groupby('Artist')['Popularity'].mean().nlargest(5).index

# Find the most popular tracks for each artist
most_popular_tracks = {}
for artist in top_5_artists:
    most_popular_tracks[artist] = df[df['Artist'] == artist].nlargest(1, 'Popularity')[['Track Name', 'Popularity']]

print("Most Popular Tracks for Top 5 Artists:")
for artist, track in most_popular_tracks.items():
    print(f"{artist}: {track['Track Name'].values[0]} - Popularity: {track['Popularity'].values[0]}")
```

---

### **8. Visualize relationships between multiple numerical variables simultaneously using a pair plot**

```python
# Select numerical columns for pair plot
numerical_columns = ['Popularity', 'Duration (ms)']

# Plot pair plot
sns.pairplot(df[numerical_columns], plot_kws={'alpha': 0.5})
plt.suptitle("Pair Plot of Popularity and Duration", y=1.02)
plt.show()
```

---

### **9. Does the duration of tracks vary significantly across different artists? Explore this visually using a box plot or violin plot**

```python
# Box plot to visualize the variation of track duration across different artists
plt.figure(figsize=(12, 6))
sns.boxplot(x='Artist', y='Duration (ms)', data=df, palette='coolwarm')
plt.title('Variation of Track Duration Across Artists')
plt.xlabel('Artist')
plt.ylabel('Duration (ms)')
plt.xticks(rotation=90)
plt.show()
```

---

### **10. How does the distribution of track popularity vary for different artists? Visualize this using a swarm plot or a violin plot**

```python
# Violin plot to visualize the distribution of track popularity across artists
plt.figure(figsize=(12, 6))
sns.violinplot(x='Artist', y='Popularity', data=df, palette='coolwarm')
plt.title('Distribution of Track Popularity Across Artists')
plt.xlabel('Artist')
plt.ylabel('Popularity')
plt.xticks(rotation=90)
plt.show()
```

---

