---
#-------------------->> Questions Answer <<--------------------
---

### **Exploratory Data Analysis (EDA) on Bike Details Dataset**  

I'll answer each question in detail with Python code where applicable.  

---

### **1. What is the range of selling prices in the dataset?**  

The range of selling prices is calculated as:  
\[
\text{Range} = \text{Maximum Selling Price} - \text{Minimum Selling Price}
\]

#### **Python Code:**
```python
import pandas as pd

# Load the dataset
df = pd.read_csv("bike_details.csv")  # Ensure correct filename

# Calculate range of selling prices
price_range = df['selling_price'].max() - df['selling_price'].min()

print(f"Range of selling prices: {price_range} INR")
```

---

### **2. What is the median selling price for bikes in the dataset?**  

The **median** selling price represents the middle value when sorted.

#### **Python Code:**
```python
median_price = df['selling_price'].median()
print(f"Median selling price: {median_price} INR")
```

---

### **3. What is the most common seller type?**  

The seller type column contains "Individual" and "Dealer". We find the most common one.

#### **Python Code:**
```python
common_seller_type = df['seller_type'].mode()[0]
print(f"Most common seller type: {common_seller_type}")
```

---

### **4. How many bikes have driven more than 50,000 kilometers?**  

#### **Python Code:**
```python
bikes_above_50k = df[df['km_driven'] > 50000].shape[0]
print(f"Number of bikes driven more than 50,000 km: {bikes_above_50k}")
```

---

### **5. What is the average km_driven value for each ownership type?**  

#### **Python Code:**
```python
avg_km_by_owner = df.groupby('owner')['km_driven'].mean()
print(avg_km_by_owner)
```

---

### **6. What proportion of bikes are from the year 2015 or older?**  

#### **Python Code:**
```python
bikes_2015_or_older = df[df['year'] <= 2015].shape[0]
total_bikes = df.shape[0]
proportion = (bikes_2015_or_older / total_bikes) * 100

print(f"Proportion of bikes from 2015 or older: {proportion:.2f}%")
```

---

### **7. What is the trend of missing values across the dataset?**  

#### **Python Code:**
```python
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)
```

---

### **8. What is the highest ex_showroom_price recorded, and for which bike?**  

#### **Python Code:**
```python
highest_ex_price = df.loc[df['ex_showroom_price'].idxmax()]
print(f"Bike with highest ex-showroom price: {highest_ex_price['name']}")
print(f"Highest ex-showroom price: {highest_ex_price['ex_showroom_price']} INR")
```

---

### **9. What is the total number of bikes listed by each seller type?**  

#### **Python Code:**
```python
seller_counts = df['seller_type'].value_counts()
print(seller_counts)
```

---

### **10. What is the relationship between selling_price and km_driven for first-owner bikes?**  

A scatter plot helps visualize this.

#### **Python Code:**
```python
import matplotlib.pyplot as plt
import seaborn as sns

first_owner_bikes = df[df['owner'] == '1st owner']

plt.figure(figsize=(8, 6))
sns.scatterplot(x=first_owner_bikes['km_driven'], y=first_owner_bikes['selling_price'])
plt.xlabel("Kilometers Driven")
plt.ylabel("Selling Price")
plt.title("Selling Price vs. Km Driven for First-Owner Bikes")
plt.show()
```

---

### **11. Identify and remove outliers in the km_driven column using the IQR method.**  

#### **Python Code:**
```python
Q1 = df['km_driven'].quantile(0.25)
Q3 = df['km_driven'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Removing outliers
df_no_outliers = df[(df['km_driven'] >= lower_bound) & (df['km_driven'] <= upper_bound)]
print(f"Number of records after outlier removal: {df_no_outliers.shape[0]}")
```

---

### **12. Perform a bivariate analysis to visualize the relationship between year and selling_price.**  

#### **Python Code:**
```python
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['year'], y=df['selling_price'])
plt.xticks(rotation=45)
plt.xlabel("Year")
plt.ylabel("Selling Price")
plt.title("Year vs. Selling Price")
plt.show()
```

---

### **13. What is the average depreciation in selling price based on the bike's age?**  

Depreciation = (Ex-showroom Price - Selling Price) / Age

#### **Python Code:**
```python
df['age'] = 2024 - df['year']
df['depreciation'] = (df['ex_showroom_price'] - df['selling_price']) / df['age']

avg_depreciation = df['depreciation'].mean()
print(f"Average depreciation per year: {avg_depreciation:.2f} INR")
```

---

### **14. Which bike names are priced significantly above the average price for their manufacturing year?**  

#### **Python Code:**
```python
avg_price_by_year = df.groupby('year')['selling_price'].mean()
df['above_avg'] = df.apply(lambda row: row['selling_price'] > avg_price_by_year[row['year']], axis=1)

expensive_bikes = df[df['above_avg']]
print(expensive_bikes[['name', 'year', 'selling_price']])
```

---

### **15. Develop a correlation matrix for numeric columns and visualize it using a heatmap.**  

#### **Python Code:**
```python
import numpy as np

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()
```

---

### **Conclusion for Bike Dataset:**
- **Selling Price Range:** Varies significantly.  
- **Median Price:** Gives a better idea of central tendency.  
- **Seller Type:** Mostly individual sellers.  
- **Older Bikes:** Significant proportion are from 2015 or earlier.  
- **Depreciation Analysis:** Shows how price reduces with age.  
- **Outlier Removal:** IQR method ensures cleaner data.  

---

### **Exploratory Data Analysis (EDA) on Car Sales Dataset**  

I'll go through each question systematically, providing Python code where necessary.  

---

### **1. What is the average selling price of cars for each dealer, and how does it compare across different dealers?**  

#### **Python Code:**
```python
import pandas as pd

# Load the dataset
df = pd.read_csv("car_sales.csv")  # Ensure the filename matches

# Calculate average selling price for each dealer
avg_price_by_dealer = df.groupby('Dealer_Name')['Price ($)'].mean().sort_values(ascending=False)

print(avg_price_by_dealer)
```

We can visualize this with a bar plot:

```python
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
avg_price_by_dealer.plot(kind='bar', color='skyblue')
plt.xlabel("Dealer Name")
plt.ylabel("Average Selling Price ($)")
plt.title("Average Selling Price by Dealer")
plt.xticks(rotation=90)
plt.show()
```

---

### **2. Which car brand (Company) has the highest variation in prices, and what does this tell us about pricing trends?**  

#### **Python Code:**
```python
price_variation = df.groupby('Company')['Price ($)'].std().sort_values(ascending=False)
print(price_variation)
```

A high standard deviation indicates significant price variations, suggesting diverse models from economy to luxury.

---

### **3. What is the distribution of car prices for each transmission type, and how do the interquartile ranges compare?**  

#### **Python Code:**
```python
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.boxplot(x=df['Transmission'], y=df['Price ($)'])
plt.xlabel("Transmission Type")
plt.ylabel("Car Price ($)")
plt.title("Car Price Distribution by Transmission Type")
plt.show()
```

---

### **4. What is the distribution of car prices across different regions?**  

#### **Python Code:**
```python
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Dealer_Region'], y=df['Price ($)'])
plt.xticks(rotation=45)
plt.xlabel("Region")
plt.ylabel("Price ($)")
plt.title("Car Price Distribution Across Regions")
plt.show()
```

---

### **5. What is the distribution of cars based on body styles?**  

#### **Python Code:**
```python
body_style_counts = df['Body Style'].value_counts()

plt.figure(figsize=(8, 6))
body_style_counts.plot(kind='bar', color='orange')
plt.xlabel("Body Style")
plt.ylabel("Number of Cars")
plt.title("Distribution of Cars by Body Style")
plt.show()
```

---

### **6. How does the average selling price of cars vary by customer gender and annual income?**  

#### **Python Code:**
```python
avg_price_by_gender = df.groupby('Gender')['Price ($)'].mean()
print(avg_price_by_gender)
```

A visualization:

```python
plt.figure(figsize=(6, 4))
sns.barplot(x=df['Gender'], y=df['Price ($)'], estimator=np.mean, palette='coolwarm')
plt.xlabel("Gender")
plt.ylabel("Average Price ($)")
plt.title("Average Car Price by Gender")
plt.show()
```

---

### **7. What is the distribution of car prices by region, and how does the number of cars sold vary by region?**  

#### **Python Code:**
```python
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Dealer_Region'], y=df['Price ($)'])
plt.xticks(rotation=45)
plt.title("Car Price Distribution by Region")
plt.show()
```

Number of cars sold per region:

```python
sales_by_region = df['Dealer_Region'].value_counts()

plt.figure(figsize=(10, 5))
sales_by_region.plot(kind='bar', color='green')
plt.xlabel("Region")
plt.ylabel("Number of Cars Sold")
plt.title("Number of Cars Sold by Region")
plt.show()
```

---

### **8. How does the average car price differ between cars with different engine sizes?**  

#### **Python Code:**
```python
avg_price_by_engine = df.groupby('Engine')['Price ($)'].mean().sort_values(ascending=False)
print(avg_price_by_engine)
```

Visualization:

```python
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['Engine'], y=df['Price ($)'])
plt.xlabel("Engine Size")
plt.ylabel("Car Price ($)")
plt.title("Car Price Distribution by Engine Size")
plt.xticks(rotation=45)
plt.show()
```

---

### **9. How do car prices vary based on the customer’s annual income bracket?**  

#### **Python Code:**
```python
df['Income Bracket'] = pd.cut(df['Annual Income'], bins=[0, 50000, 100000, 150000, 200000, 500000],
                              labels=['<50K', '50K-100K', '100K-150K', '150K-200K', '200K+'])

plt.figure(figsize=(8, 6))
sns.boxplot(x=df['Income Bracket'], y=df['Price ($)'])
plt.xlabel("Annual Income Bracket")
plt.ylabel("Car Price ($)")
plt.title("Car Price Variation by Annual Income Bracket")
plt.show()
```

---

### **10. What are the top 5 car models with the highest number of sales, and how does their price distribution look?**  

#### **Python Code:**
```python
top_models = df['Model'].value_counts().head(5)
print(top_models)
```

Visualization:

```python
plt.figure(figsize=(8, 6))
sns.boxplot(x=df[df['Model'].isin(top_models.index)]['Model'], y=df['Price ($)'])
plt.xlabel("Car Model")
plt.ylabel("Price ($)")
plt.title("Price Distribution of Top 5 Sold Car Models")
plt.xticks(rotation=45)
plt.show()
```

---

### **11. How does car price vary with engine size across different car colors, and which colors have the highest price variation?**  

#### **Python Code:**
```python
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Color'], y=df['Price ($)'])
plt.xticks(rotation=45)
plt.title("Car Price Distribution by Color")
plt.show()
```

---

### **12. Is there any seasonal trend in car sales based on the date of sale?**  

#### **Python Code:**
```python
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.month

monthly_sales = df.groupby('Month')['Price ($)'].count()

plt.figure(figsize=(8, 5))
monthly_sales.plot(kind='bar', color='purple')
plt.xlabel("Month")
plt.ylabel("Number of Sales")
plt.title("Seasonal Trends in Car Sales")
plt.show()
```

---

### **13. How does the car price distribution change when considering different combinations of body style and transmission type?**  

#### **Python Code:**
```python
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Body Style'], y=df['Price ($)'], hue=df['Transmission'])
plt.xticks(rotation=45)
plt.title("Car Price Distribution by Body Style and Transmission")
plt.show()
```

---

### **14. What is the correlation between car price, engine size, and annual income of customers, and how do these features interact?**  

#### **Python Code:**
```python
plt.figure(figsize=(8, 6))
sns.heatmap(df[['Price ($)', 'Annual Income']].corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
```

---

### **15. How does the average car price vary across different car models and engine types?**  

#### **Python Code:**
```python
avg_price_by_model_engine = df.groupby(['Model', 'Engine'])['Price ($)'].mean()
print(avg_price_by_model_engine)
```

---

### **Conclusion for Car Sales Dataset:**
- **Price Variation by Dealer & Brand:** Some brands show high variation due to luxury models.  
- **Transmission Type & Pricing:** Automatic cars tend to be pricier.  
- **Seasonal Sales Trends:** Certain months show more sales.  
- **Income & Pricing:** Higher income correlates with more expensive car purchases.  

---

### **Exploratory Data Analysis (EDA) on Amazon Sales Dataset**  

I'll go through each question systematically, providing Python code where necessary.  

---

### **1. What is the average rating for each product category?**  

#### **Python Code:**
```python
import pandas as pd

# Load the dataset
df = pd.read_csv("amazon_sales.csv")  # Ensure the filename is correct

# Calculate average rating for each category
avg_rating_by_category = df.groupby('category')['rating'].mean().sort_values(ascending=False)

print(avg_rating_by_category)
```

Visualization:

```python
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
avg_rating_by_category.plot(kind='bar', color='skyblue')
plt.xlabel("Category")
plt.ylabel("Average Rating")
plt.title("Average Rating by Product Category")
plt.xticks(rotation=90)
plt.show()
```

---

### **2. What are the top rating_count products by category?**  

#### **Python Code:**
```python
top_rated_products = df.groupby('category').apply(lambda x: x.nlargest(1, 'rating_count'))[['product_name', 'rating_count']]
print(top_rated_products)
```

---

### **3. What is the distribution of discounted prices vs. actual prices?**  

#### **Python Code:**
```python
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.histplot(df['discounted_price'], bins=30, color='blue', label='Discounted Price', kde=True)
sns.histplot(df['actual_price'], bins=30, color='red', label='Actual Price', kde=True)
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.title("Distribution of Discounted vs. Actual Prices")
plt.legend()
plt.show()
```

---

### **4. How does the average discount percentage vary across categories?**  

#### **Python Code:**
```python
df['discount_percentage'] = ((df['actual_price'] - df['discounted_price']) / df['actual_price']) * 100
avg_discount_by_category = df.groupby('category')['discount_percentage'].mean().sort_values(ascending=False)

print(avg_discount_by_category)
```

Visualization:

```python
plt.figure(figsize=(10, 6))
avg_discount_by_category.plot(kind='bar', color='orange')
plt.xlabel("Category")
plt.ylabel("Average Discount (%)")
plt.title("Average Discount Percentage by Category")
plt.xticks(rotation=90)
plt.show()
```

---

### **5. What are the most popular product names?**  

#### **Python Code:**
```python
popular_products = df['product_name'].value_counts().head(10)
print(popular_products)
```

---

### **6. What are the most popular product keywords?**  

#### **Python Code:**
```python
from collections import Counter
import itertools

# Extract words from product names
words = list(itertools.chain(*df['product_name'].str.lower().str.split()))
word_counts = Counter(words)

# Top 10 most common keywords
print(word_counts.most_common(10))
```

---

### **7. What are the most popular product reviews?**  

#### **Python Code:**
```python
popular_reviews = df['review_title'].value_counts().head(10)
print(popular_reviews)
```

---

### **8. What is the correlation between discounted_price and rating?**  

#### **Python Code:**
```python
correlation = df[['discounted_price', 'rating']].corr()
print(correlation)
```

Visualization:

```python
plt.figure(figsize=(6, 4))
sns.heatmap(correlation, annot=True, cmap="coolwarm")
plt.title("Correlation between Discounted Price and Rating")
plt.show()
```

---

### **9. What are the Top 5 categories based on the highest ratings?**  

#### **Python Code:**
```python
top_categories = df.groupby('category')['rating'].mean().sort_values(ascending=False).head(5)
print(top_categories)
```

---

### **10. Identify any potential areas for improvement or optimization based on the data analysis.**  

Key insights:  
- Categories with **low ratings** may need product quality improvements.  
- **High discount percentage** does not always lead to higher ratings.  
- Certain **keywords** appear frequently, suggesting trends in product demand.  

---

### **Conclusion for Amazon Sales Dataset:**
- **Best-rated Categories:** Identified top-rated product categories.  
- **Popular Keywords:** Common words in product names give insights into trending products.  
- **Price vs. Rating Correlation:** No strong correlation between discount price and ratings.  
- **Discount Strategies:** Some categories have high discounts but low ratings, indicating a potential issue.  

---

### **Exploratory Data Analysis (EDA) on Spotify Dataset**  

I'll go through each question systematically, providing Python code where necessary.  

---

### **1. Read the dataframe, check null values, and remove duplicates if present.**  

#### **Python Code:**
```python
import pandas as pd

# Load the dataset
df = pd.read_csv("spotify_data.csv")  # Ensure correct filename

# Check for null values
missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

# Remove duplicate rows
df = df.drop_duplicates()
print(f"Dataset size after removing duplicates: {df.shape}")
```

---

### **2. What is the distribution of popularity among the tracks in the dataset? Visualize it using a histogram.**  

#### **Python Code:**
```python
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.histplot(df['Popularity'], bins=30, kde=True, color='blue')
plt.xlabel("Popularity Score")
plt.ylabel("Frequency")
plt.title("Distribution of Track Popularity")
plt.show()
```

---

### **3. Is there any relationship between popularity and track duration? (Scatter plot)**  

#### **Python Code:**
```python
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df['Duration (ms)'], y=df['Popularity'], alpha=0.5)
plt.xlabel("Track Duration (ms)")
plt.ylabel("Popularity Score")
plt.title("Popularity vs. Track Duration")
plt.show()
```

---

### **4. Which artist has the highest number of tracks in the dataset? Display the count using a bar plot.**  

#### **Python Code:**
```python
artist_counts = df['Artist'].value_counts().head(10)
print(artist_counts)

plt.figure(figsize=(10, 6))
artist_counts.plot(kind='bar', color='green')
plt.xlabel("Artist")
plt.ylabel("Number of Tracks")
plt.title("Top 10 Artists with the Most Tracks")
plt.xticks(rotation=45)
plt.show()
```

---

### **5. What are the top 5 least popular tracks in the dataset? Provide the artist name and track name.**  

#### **Python Code:**
```python
least_popular_tracks = df.nsmallest(5, 'Popularity')[['Artist', 'Track Name', 'Popularity']]
print(least_popular_tracks)
```

---

### **6. Among the top 5 most popular artists, which one has the highest average popularity?**  

#### **Python Code:**
```python
top_artists = df['Artist'].value_counts().head(5).index
avg_popularity = df[df['Artist'].isin(top_artists)].groupby('Artist')['Popularity'].mean().sort_values(ascending=False)
print(avg_popularity)
```

---

### **7. For the top 5 most popular artists, what are their most popular tracks?**  

#### **Python Code:**
```python
popular_tracks_by_artist = df[df['Artist'].isin(top_artists)].groupby('Artist').apply(lambda x: x.nlargest(1, 'Popularity'))[['Track Name', 'Popularity']]
print(popular_tracks_by_artist)
```

---

### **8. Visualize relationships between multiple numerical variables using a pair plot.**  

#### **Python Code:**
```python
sns.pairplot(df[['Popularity', 'Duration (ms)']])
plt.show()
```

---

### **9. Does the duration of tracks vary significantly across different artists? (Box plot or violin plot)**  

#### **Python Code:**
```python
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['Artist'], y=df['Duration (ms)'])
plt.xticks(rotation=90)
plt.xlabel("Artist")
plt.ylabel("Duration (ms)")
plt.title("Track Duration by Artist")
plt.show()
```

---

### **10. How does the distribution of track popularity vary for different artists? (Swarm plot or violin plot)**  

#### **Python Code:**
```python
plt.figure(figsize=(12, 6))
sns.violinplot(x=df['Artist'], y=df['Popularity'])
plt.xticks(rotation=90)
plt.xlabel("Artist")
plt.ylabel("Popularity")
plt.title("Popularity Distribution by Artist")
plt.show()
```

---

### **Conclusion for Spotify Dataset:**
- **Track Popularity:** Most tracks have mid-range popularity scores.  
- **Duration vs. Popularity:** No strong correlation between track duration and popularity.  
- **Top Artists:** Certain artists dominate the dataset.  
- **Popular Tracks:** The most popular tracks are clustered among a few artists.  

---

