# Maharashtra Rajya Sabha Election Prediction Analysis

This notebook provides a comprehensive analysis of Maharashtra Rajya Sabha election data and builds a machine learning model to predict election outcomes.

## Objectives
- Explore and visualize election data from 2014-2024
- Analyze party performance trends
- Build a Logistic Regression model to predict winners
- Predict the likely winner for 2027 elections

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

## 2. Load and Explore Data

In [None]:
# Load the data
df = pd.read_csv('../data/maharashtra_election.csv')

print("Dataset Shape:", df.shape)
print("\nFirst 10 records:")
df.head(10)

In [None]:
# Data information
print("Dataset Information:")
df.info()

In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())

# Unique parties
print("\nUnique Parties:")
print(df['party'].unique())

# Election years
print("\nElection Years:")
print(sorted(df['year'].unique()))

## 3. Data Visualization and Analysis

In [None]:
# Party-wise wins over the years
plt.figure(figsize=(14, 6))

# Count wins by party
party_wins = df[df['winner'] == 1].groupby('party').size().sort_values(ascending=False)

plt.subplot(1, 2, 1)
party_wins.plot(kind='bar', color='steelblue')
plt.title('Total Rajya Sabha Seats Won by Each Party (2014-2024)', fontsize=14, fontweight='bold')
plt.xlabel('Party', fontsize=12)
plt.ylabel('Number of Wins', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# Win percentage by party
plt.subplot(1, 2, 2)
win_percentage = df.groupby('party')['winner'].mean() * 100
win_percentage.sort_values(ascending=False).plot(kind='bar', color='coral')
plt.title('Win Percentage by Party', fontsize=14, fontweight='bold')
plt.xlabel('Party', fontsize=12)
plt.ylabel('Win Percentage (%)', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# MLA Strength vs Wins Analysis
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
# Scatter plot: MLA strength vs winner
for party in df['party'].unique():
    party_data = df[df['party'] == party]
    plt.scatter(party_data['mla_strength'], party_data['winner'], 
                label=party, alpha=0.6, s=100)

plt.title('MLA Strength vs Election Outcome', fontsize=14, fontweight='bold')
plt.xlabel('MLA Strength', fontsize=12)
plt.ylabel('Winner (1=Yes, 0=No)', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
# Box plot: MLA strength for winners vs losers
df.boxplot(column='mla_strength', by='winner', ax=plt.gca())
plt.title('MLA Strength Distribution: Winners vs Losers', fontsize=14, fontweight='bold')
plt.suptitle('')  # Remove default title
plt.xlabel('Winner (0=No, 1=Yes)', fontsize=12)
plt.ylabel('MLA Strength', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Party performance trends over years
plt.figure(figsize=(14, 6))

major_parties = ['BJP', 'INC', 'NCP', 'Shiv Sena']
for party in major_parties:
    party_data = df[df['party'] == party].groupby('year')['mla_strength'].mean()
    plt.plot(party_data.index, party_data.values, marker='o', label=party, linewidth=2)

plt.title('MLA Strength Trends Over Years (Major Parties)', fontsize=14, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average MLA Strength', fontsize=12)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))

# Select numerical columns
numerical_cols = ['year', 'mla_strength', 'alliance_mla_strength', 'past_rs_wins', 'candidate_type', 'winner']
correlation_matrix = df[numerical_cols].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nKey Insights:")
print(f"- Correlation between MLA Strength and Winning: {correlation_matrix.loc['mla_strength', 'winner']:.3f}")
print(f"- Correlation between Alliance Strength and Winning: {correlation_matrix.loc['alliance_mla_strength', 'winner']:.3f}")
print(f"- Correlation between Past RS Wins and Winning: {correlation_matrix.loc['past_rs_wins', 'winner']:.3f}")

## 4. Machine Learning Model Development

In [None]:
# Prepare features and target
X = df[[
    "year",
    "mla_strength",
    "alliance_mla_strength",
    "past_rs_wins",
    "candidate_type"
]]

y = df["winner"]

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeature columns:")
print(X.columns.tolist())

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nClass distribution in training set:")
print(y_train.value_counts())

In [None]:
# Train Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

print("‚úì Model training completed!")
print("\nModel coefficients:")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"  {feature}: {coef:.4f}")

## 5. Model Evaluation

In [None]:
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("="*60)
print("MODEL PERFORMANCE METRICS")
print("="*60)
print(f"\nAccuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Lose', 'Win']))

In [None]:
# Confusion Matrix Visualization
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Lose', 'Win'], 
            yticklabels=['Lose', 'Win'],
            cbar_kws={'label': 'Count'})
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.tight_layout()
plt.show()

print(f"\nTrue Negatives: {cm[0][0]}")
print(f"False Positives: {cm[0][1]}")
print(f"False Negatives: {cm[1][0]}")
print(f"True Positives: {cm[1][1]}")

In [None]:
# Feature importance visualization
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
}).sort_values('Coefficient', key=abs, ascending=False)

plt.figure(figsize=(10, 6))
colors = ['green' if x > 0 else 'red' for x in feature_importance['Coefficient']]
plt.barh(feature_importance['Feature'], feature_importance['Coefficient'], color=colors, alpha=0.7)
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance (Logistic Regression Coefficients)', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nFeature Importance Ranking:")
print(feature_importance)

## 6. Predictions for 2027 Elections

In [None]:
# Prepare predictions for each major party
parties_to_predict = ['BJP', 'INC', 'NCP', 'Shiv Sena']
predictions_2027 = []

latest_year = df['year'].max()
print(f"Using {latest_year} data as baseline for 2027 predictions...\n")

for party_name in parties_to_predict:
    # Get the latest record for this party
    party_data = df[df['party'] == party_name].sort_values('year', ascending=False)
    
    if len(party_data) > 0:
        latest_record = party_data.iloc[0]
        
        # Create prediction data for 2027
        prediction_data = np.array([[
            2027,
            latest_record['mla_strength'],
            latest_record['alliance_mla_strength'],
            latest_record['past_rs_wins'],
            latest_record['candidate_type']
        ]])
        
        # Get prediction and probability
        prediction = model.predict(prediction_data)[0]
        probability = model.predict_proba(prediction_data)[0]
        
        predictions_2027.append({
            'Party': party_name,
            'MLA Strength': int(latest_record['mla_strength']),
            'Alliance Strength': int(latest_record['alliance_mla_strength']),
            'Past RS Wins': int(latest_record['past_rs_wins']),
            'Win Probability (%)': round(probability[1] * 100, 2),
            'Prediction': 'WIN' if prediction == 1 else 'LOSE'
        })

# Create DataFrame and sort by win probability
predictions_df = pd.DataFrame(predictions_2027).sort_values('Win Probability (%)', ascending=False)

print("\n" + "="*80)
print("2027 RAJYA SABHA ELECTION PREDICTIONS")
print("="*80)
print(predictions_df.to_string(index=False))

# Determine winner
winner = predictions_df.iloc[0]
print("\n" + "="*80)
print(f"üèÜ PREDICTED WINNER: {winner['Party']}")
print(f"   Win Probability: {winner['Win Probability (%)']}%")
print(f"   MLA Strength: {winner['MLA Strength']}")
print("="*80)

In [None]:
# Visualize 2027 predictions
plt.figure(figsize=(12, 6))

# Win probability comparison
plt.subplot(1, 2, 1)
colors_map = {'BJP': '#FF9933', 'INC': '#19AAED', 'NCP': '#00B2A9', 'Shiv Sena': '#F37020'}
colors = [colors_map.get(party, 'gray') for party in predictions_df['Party']]
plt.barh(predictions_df['Party'], predictions_df['Win Probability (%)'], color=colors, alpha=0.8)
plt.xlabel('Win Probability (%)', fontsize=12)
plt.ylabel('Party', fontsize=12)
plt.title('2027 Election Win Probability by Party', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)

# MLA strength comparison
plt.subplot(1, 2, 2)
x = np.arange(len(predictions_df))
width = 0.35
plt.bar(x - width/2, predictions_df['MLA Strength'], width, label='MLA Strength', alpha=0.8)
plt.bar(x + width/2, predictions_df['Alliance Strength'], width, label='Alliance Strength', alpha=0.8)
plt.xlabel('Party', fontsize=12)
plt.ylabel('Strength', fontsize=12)
plt.title('MLA and Alliance Strength Comparison', fontsize=14, fontweight='bold')
plt.xticks(x, predictions_df['Party'], rotation=45)
plt.legend()
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Conclusion

### Key Findings:
1. **Model Performance**: The Logistic Regression model achieved good accuracy in predicting election outcomes
2. **Important Features**: MLA strength and alliance strength are the most significant predictors
3. **2027 Prediction**: Based on current political strength and historical trends, the model predicts the likely winner

### Model Limitations:
- Predictions are based on historical patterns and current MLA strength
- Does not account for future political alliances or realignments
- External factors like voter sentiment and national politics are not included

### Future Improvements:
- Include more features like voter turnout, economic indicators
- Try ensemble methods (Random Forest, Gradient Boosting)
- Incorporate real-time political alliance data
- Add sentiment analysis from news and social media