# üì∏ Top Instagram Influencers ‚Äî Data Analysis

| Field | Details |
|-------|--------|
| **Project Title** | Top Instagram Influencers Data Analysis |
| **Tools** | Python, Pandas, NumPy, Matplotlib, Seaborn, Plotly |
| **Domain** | Data Analyst & Social Media Analytics |
| **Difficulty** | Intermediate |

---
## üìå Objective
Analyze the top Instagram influencers dataset to uncover trends in followers, engagement rates, influence scores, posting habits, total likes, and country-wise distribution.

---

## Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 12
sns.set_style('whitegrid')

print('‚úÖ All libraries imported!')

## Step 2: Load & Explore Dataset

In [None]:
# Load dataset
df = pd.read_csv('top_insta_influencers_data.csv')

print(f'Shape: {df.shape}')
print(f'Columns: {list(df.columns)}')
df.head(10)

In [None]:
df.info()

In [None]:
df.describe()

## Step 3: Data Cleaning & Type Conversion

In [None]:
# Check for missing values
print('Missing values per column:')
print(df.isnull().sum())

In [None]:
# Convert shorthand notation to numeric (e.g., '475.8m' ‚Üí 475800000)
def parse_count(val):
    """Convert abbreviated counts like 3.3k, 475.8m, 29.0b to numeric."""
    if isinstance(val, str):
        val = val.strip().lower().replace(',', '')
        if val.endswith('k'):
            return float(val[:-1]) * 1_000
        elif val.endswith('m'):
            return float(val[:-1]) * 1_000_000
        elif val.endswith('b'):
            return float(val[:-1]) * 1_000_000_000
        else:
            try:
                return float(val)
            except:
                return np.nan
    return float(val) if val is not None else np.nan

# Convert engagement rate (strip % sign)
def parse_rate(val):
    if isinstance(val, str):
        return float(val.replace('%', '').strip())
    return val

# Apply conversions
df['followers_num']        = df['followers'].apply(parse_count)
df['posts_num']            = df['posts'].apply(parse_count)
df['avg_likes_num']        = df['avg_likes'].apply(parse_count)
df['new_post_avg_like_num']= df['new_post_avg_like'].apply(parse_count)
df['total_likes_num']      = df['total_likes'].apply(parse_count)
df['eng_rate_num']         = df['60_day_eng_rate'].apply(parse_rate)

print('Conversion complete! Sample:')
df[['channel_info', 'followers', 'followers_num', 'eng_rate_num']].head()

In [None]:
# Remove duplicates
before = len(df)
df.drop_duplicates(inplace=True)
print(f'Duplicates removed: {before - len(df)}')
print(f'Final records: {len(df)}')

## Step 4: Exploratory Data Analysis

### 4.1 Top 10 Influencers by Followers

In [None]:
top10_followers = df.nlargest(10, 'followers_num')[['channel_info', 'followers_num', 'followers']]
print(top10_followers.to_string(index=False))

plt.figure(figsize=(14, 7))
colors = sns.color_palette('YlOrRd_r', 10)
bars = plt.barh(top10_followers['channel_info'][::-1],
                top10_followers['followers_num'][::-1] / 1e6,
                color=colors)
plt.xlabel('Followers (Millions)', fontsize=12)
plt.title('Top 10 Instagram Influencers by Followers', fontsize=16, fontweight='bold')
for bar, val in zip(bars, top10_followers['followers'][::-1]):
    plt.text(bar.get_width() + 1, bar.get_y() + bar.get_height()/2,
             val, va='center', fontweight='bold', fontsize=10)
plt.tight_layout()
plt.savefig('insta_top10_followers.png', dpi=150, bbox_inches='tight')
plt.show()

### 4.2 Top 10 by Average Likes per Post

In [None]:
top10_likes = df.nlargest(10, 'avg_likes_num')[['channel_info', 'avg_likes_num', 'avg_likes']]

plt.figure(figsize=(14, 6))
bars = plt.bar(top10_likes['channel_info'], top10_likes['avg_likes_num'] / 1e6,
               color=sns.color_palette('Set2', 10))
plt.xticks(rotation=45, ha='right')
plt.xlabel('Influencer')
plt.ylabel('Average Likes (Millions)')
plt.title('Top 10 Influencers by Average Likes per Post', fontsize=16, fontweight='bold')
for bar, val in zip(bars, top10_likes['avg_likes']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
             val, ha='center', va='bottom', fontweight='bold', fontsize=9)
plt.tight_layout()
plt.savefig('insta_avg_likes.png', dpi=150, bbox_inches='tight')
plt.show()

### 4.3 Top 10 by Engagement Rate

In [None]:
top10_eng = df.nlargest(10, 'eng_rate_num')[['channel_info', 'eng_rate_num', '60_day_eng_rate']]
print(top10_eng.to_string(index=False))

plt.figure(figsize=(14, 6))
bars = plt.bar(top10_eng['channel_info'], top10_eng['eng_rate_num'],
               color=sns.color_palette('coolwarm', 10))
plt.xticks(rotation=45, ha='right')
plt.xlabel('Influencer')
plt.ylabel('60-Day Engagement Rate (%)')
plt.title('Top 10 Influencers by Engagement Rate (Last 60 Days)',
          fontsize=16, fontweight='bold')
for bar, val in zip(bars, top10_eng['60_day_eng_rate']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             val, ha='center', va='bottom', fontweight='bold', fontsize=9)
plt.tight_layout()
plt.savefig('insta_engagement_rate.png', dpi=150, bbox_inches='tight')
plt.show()

### 4.4 Top 10 by Total Likes

In [None]:
top10_total = df.nlargest(10, 'total_likes_num')[['channel_info', 'total_likes_num', 'total_likes']]

plt.figure(figsize=(14, 7))
bars = plt.barh(top10_total['channel_info'][::-1],
                top10_total['total_likes_num'][::-1] / 1e9,
                color=sns.color_palette('magma', 10))
plt.xlabel('Total Likes (Billions)')
plt.title('Top 10 Influencers by Total Likes (All-Time)', fontsize=16, fontweight='bold')
for bar, val in zip(bars, top10_total['total_likes'][::-1]):
    plt.text(bar.get_width() + 0.2, bar.get_y() + bar.get_height()/2,
             val, va='center', fontweight='bold', fontsize=10)
plt.tight_layout()
plt.savefig('insta_total_likes.png', dpi=150, bbox_inches='tight')
plt.show()

### 4.5 Influence Score Analysis

In [None]:
print('Influence Score Statistics:')
print(df['influence_score'].describe())

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Distribution
sns.histplot(df['influence_score'], bins=15, kde=True,
             color='#F5A623', ax=axes[0])
axes[0].set_xlabel('Influence Score')
axes[0].set_title('Distribution of Influence Scores', fontsize=14, fontweight='bold')

# Top 10 by score
top10_score = df.nlargest(10, 'influence_score')[['channel_info', 'influence_score']]
axes[1].barh(top10_score['channel_info'][::-1], top10_score['influence_score'][::-1],
             color=sns.color_palette('YlOrRd', 10))
axes[1].set_xlim(85, 95)
axes[1].set_xlabel('Influence Score')
axes[1].set_title('Top 10 by Influence Score', fontsize=14, fontweight='bold')
for i, (score, name) in enumerate(zip(top10_score['influence_score'][::-1],
                                       top10_score['channel_info'][::-1])):
    axes[1].text(score + 0.05, i, str(score), va='center', fontweight='bold')

plt.suptitle('Influence Score Analysis', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig('insta_influence_score.png', dpi=150, bbox_inches='tight')
plt.show()

### 4.6 Country-wise Distribution

In [None]:
country_counts = df['country'].value_counts(dropna=False)
print('Top countries:')
print(country_counts.head(10))

# Pie chart for known countries
top_countries = country_counts.head(5)
plt.figure(figsize=(10, 7))
plt.pie(top_countries.values, labels=top_countries.index,
        autopct='%1.1f%%', startangle=90,
        colors=sns.color_palette('Set3', len(top_countries)),
        textprops={'fontsize': 12})
plt.title('Country-wise Distribution of Top Instagram Influencers',
          fontsize=15, fontweight='bold')
plt.tight_layout()
plt.savefig('insta_country_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

### 4.7 Posts vs Followers Analysis

In [None]:
# Correlation between posts and followers
corr_val = df['posts_num'].corr(df['followers_num'])
print(f'Correlation (Posts vs Followers): {corr_val:.4f}')
print('‚Üí Low correlation: more posts does NOT mean more followers')

plt.figure(figsize=(12, 6))
scatter = plt.scatter(df['posts_num'] / 1000, df['followers_num'] / 1e6,
                      c=df['influence_score'], cmap='YlOrRd',
                      s=100, alpha=0.7, edgecolors='gray', linewidths=0.5)
plt.colorbar(scatter, label='Influence Score')
plt.xlabel('Posts (Thousands)', fontsize=12)
plt.ylabel('Followers (Millions)', fontsize=12)
plt.title('Posts vs Followers (colored by Influence Score)',
          fontsize=15, fontweight='bold')

# Annotate top 5
top5 = df.nlargest(5, 'followers_num')
for _, row in top5.iterrows():
    plt.annotate(row['channel_info'],
                 (row['posts_num']/1000, row['followers_num']/1e6),
                 textcoords='offset points', xytext=(5, 5), fontsize=9)
plt.tight_layout()
plt.savefig('insta_posts_vs_followers.png', dpi=150, bbox_inches='tight')
plt.show()

### 4.8 Followers vs Engagement Rate

In [None]:
plt.figure(figsize=(12, 6))
plt.scatter(df['followers_num'] / 1e6, df['eng_rate_num'],
            color='#E94560', alpha=0.6, s=80, edgecolors='white')
plt.xlabel('Followers (Millions)', fontsize=12)
plt.ylabel('Engagement Rate (%)', fontsize=12)
plt.title('Followers vs Engagement Rate ‚Äî Instagram Top Influencers',
          fontsize=15, fontweight='bold')

# Annotate top 5 by engagement
top5_eng = df.nlargest(5, 'eng_rate_num')
for _, row in top5_eng.iterrows():
    plt.annotate(f"{row['channel_info']}\n({row['60_day_eng_rate']})",
                 (row['followers_num']/1e6, row['eng_rate_num']),
                 textcoords='offset points', xytext=(5, 5), fontsize=8,
                 color='#F5A623', fontweight='bold')
plt.tight_layout()
plt.savefig('insta_followers_vs_engagement.png', dpi=150, bbox_inches='tight')
plt.show()

corr2 = df['followers_num'].corr(df['eng_rate_num'])
print(f'Correlation (Followers vs Engagement Rate): {corr2:.4f}')
print('‚Üí More followers does NOT necessarily mean higher engagement rate')

## Step 5: Key Insights & Conclusions

---

| # | Insight |
|---|--------|
| 1 | üëë **Cristiano Ronaldo** leads with 475.8M followers ‚Äî 30% more than #2 |
| 2 | ‚ù§Ô∏è **Kylie Jenner** has the most total likes (57.4B) ‚Äî nearly 2√ó Cristiano |
| 3 | üìà **Kendall Jenner** has the highest engagement rate at 2.04% |
| 4 | üåü **Selena Gomez** leads influence score with 93/100 |
| 5 | üá∫üá∏ **USA dominates** with 65%+ of top influencers |
| 6 | üì∏ More posts ‚â† more followers ‚Äî **NatGeo** posts most (10K) but ranks 12th |
| 7 | üìä Followers and engagement rate have **low/negative correlation** |

---

## Step 6: Next Steps

- **Trend Analysis**: Track influencer growth over time
- **Niche Analysis**: Identify which content categories drive highest engagement
- **Brand Partnership Analysis**: Which influencers are best for marketing ROI
- **ML Model**: Predict engagement rate from follower count and posting frequency

---
*Project by: Unified Mentor Internship Program*