# Stack Overflow Analytics dengan PySpark dan NLP

**UAS - Big Data Predictive Analytics Lanjut**

Notebook ini mendemonstrasikan pipeline analisis data Stack Overflow menggunakan:
- Apache Spark untuk Big Data Processing
- NLP untuk Text Analysis
- Machine Learning untuk Predictive Analytics

---

## 1. Setup Environment

Install dependencies yang diperlukan.

In [1]:
# Install dependencies
!pip install pyspark==3.5.0 pandas numpy nltk matplotlib seaborn wordcloud plotly kaleido -q

print("‚úÖ Dependencies installed successfully!")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
‚úÖ Dependencies installed successfully!


In [2]:
# Download NLTK data
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('vader_lexicon', quiet=True)

print("NLTK data downloaded!")

NLTK data downloaded!


In [3]:
# Import libraries
import os
import re
import sys
from datetime import datetime

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Setup untuk Google Colab - Chart Generator
try:
    # Jika di Colab, clone repo atau import langsung
    import google.colab
    IN_COLAB = True
    print("üîµ Running in Google Colab")
    
    # Setup inline plotting
    %matplotlib inline
    from IPython.display import set_matplotlib_formats
    try:
        set_matplotlib_formats('retina')
    except:
        pass
    
    plt.rcParams['figure.dpi'] = 100
    plt.rcParams['savefig.dpi'] = 150
    
except ImportError:
    IN_COLAB = False
    print("üîµ Running locally")

# Import Plotly untuk interactive charts
try:
    import plotly.express as px
    import plotly.graph_objects as go
    PLOTLY_AVAILABLE = True
    print("‚úÖ Plotly available for interactive charts")
except ImportError:
    PLOTLY_AVAILABLE = False
    print("‚ö†Ô∏è Plotly not available, using static charts")

print("‚úÖ Libraries imported successfully!")

ModuleNotFoundError: No module named 'distutils'

## 2. Initialize Spark Session

In [None]:
# Create Spark Session
spark = SparkSession.builder \
    .appName("StackOverflow-Analytics") \
    .config("spark.driver.memory", "4g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print("Spark Session created successfully!")

## 3. Create Sample Dataset

Membuat sample data yang merepresentasikan Stack Overflow questions.

In [None]:
# Sample Stack Overflow Data
sample_data = [
    (1, "How to read CSV file in Python using pandas?", 
     "I want to read a CSV file and convert it to DataFrame. What is the best approach?",
     "<python><pandas><csv>", 45, 12500, 5, 1, "2024-01-15"),
    (2, "JavaScript async await not working properly",
     "My async function is not waiting for the promise to resolve. Getting undefined.",
     "<javascript><async-await><promise>", 32, 8900, 4, 1, "2024-01-20"),
    (3, "Docker container networking between containers",
     "How can I connect two Docker containers so they can communicate with each other?",
     "<docker><networking><containers>", 28, 6700, 3, 1, "2024-02-05"),
    (4, "Python list comprehension with if else",
     "How do I write a list comprehension with conditional logic in Python?",
     "<python><list-comprehension>", 55, 15000, 6, 1, "2024-02-10"),
    (5, "React useState not updating immediately",
     "State updates are not reflected immediately after calling setState. Why?",
     "<reactjs><hooks><state>", 38, 11200, 5, 1, "2024-02-15"),
    (6, "SQL JOIN multiple tables with conditions",
     "How to join three tables with WHERE clause and GROUP BY in SQL?",
     "<sql><join><mysql>", 42, 9800, 4, 1, "2024-02-20"),
    (7, "Git merge conflict resolution best practices",
     "What is the best way to resolve merge conflicts in Git without losing changes?",
     "<git><merge-conflict><version-control>", 35, 7500, 4, 1, "2024-03-01"),
    (8, "Machine learning model overfitting problem",
     "My neural network is overfitting on training data. How to prevent this?",
     "<machine-learning><overfitting><deep-learning>", 48, 13400, 6, 1, "2024-03-05"),
    (9, "Kubernetes pod keeps crashing with OOMKilled",
     "My Kubernetes pod is being killed due to memory limit. How to debug?",
     "<kubernetes><docker><memory>", 25, 5600, 3, 1, "2024-03-10"),
    (10, "Python pandas merge dataframes on multiple columns",
     "How to merge two dataframes on multiple columns with different names?",
     "<python><pandas><dataframe>", 52, 14200, 5, 1, "2024-03-15"),
    (11, "CSS flexbox center align not working",
     "I am trying to center a div using flexbox but it is not centering properly.",
     "<css><flexbox><html>", 22, 4800, 3, 0, "2024-03-20"),
    (12, "Node.js express middleware order matters",
     "Why does the order of middleware in Express.js affect my application?",
     "<node.js><express><middleware>", 30, 6200, 3, 1, "2024-03-25"),
    (13, "TensorFlow GPU not detected on Windows",
     "TensorFlow is not recognizing my NVIDIA GPU. CUDA is installed properly.",
     "<tensorflow><gpu><cuda>", 18, 4200, 2, 0, "2024-04-01"),
    (14, "Apache Spark DataFrame operations slow",
     "My Spark job is running very slow. How to optimize DataFrame operations?",
     "<apache-spark><pyspark><performance>", 40, 8500, 4, 1, "2024-04-05"),
    (15, "REST API authentication JWT vs OAuth",
     "What is the difference between JWT and OAuth for API authentication?",
     "<rest><authentication><jwt><oauth>", 65, 18000, 7, 1, "2024-04-10"),
    (16, "Why is my Python code so slow?",
     "My Python script takes too long to run. How can I make it faster?",
     "<python><performance><optimization>", -2, 1200, 1, 0, "2024-04-15"),
    (17, "Help needed with homework assignment",
     "Can someone solve this for me? I don't understand the question.",
     "<python>", -5, 800, 0, 0, "2024-04-20"),
    (18, "Best programming language to learn in 2024",
     "Which programming language should I learn as a beginner in 2024?",
     "<programming-languages><career>", 15, 5500, 8, 0, "2024-04-25"),
    (19, "MongoDB aggregation pipeline group by date",
     "How to group documents by date and calculate sum in MongoDB aggregation?",
     "<mongodb><aggregation><database>", 33, 7200, 3, 1, "2024-05-01"),
    (20, "AWS Lambda cold start optimization",
     "How to reduce cold start time for AWS Lambda functions in Python?",
     "<aws-lambda><python><serverless>", 45, 9600, 4, 1, "2024-05-05"),
]

# Create DataFrame
schema = StructType([
    StructField("Id", IntegerType(), True),
    StructField("Title", StringType(), True),
    StructField("Body", StringType(), True),
    StructField("Tags", StringType(), True),
    StructField("Score", IntegerType(), True),
    StructField("ViewCount", IntegerType(), True),
    StructField("AnswerCount", IntegerType(), True),
    StructField("HasAcceptedAnswer", IntegerType(), True),
    StructField("CreationDate", StringType(), True),
])

df = spark.createDataFrame(sample_data, schema)

print(f"Total Questions: {df.count()}")
df.show(5, truncate=50)

## 4. ETL - Data Transformation

In [None]:
# UDF untuk parsing tags
@F.udf(ArrayType(StringType()))
def parse_tags(tags_str):
    if not tags_str:
        return []
    return re.findall(r'<([^>]+)>', tags_str)

# Transform data
df_transformed = df \
    .withColumn("TagsList", parse_tags(F.col("Tags"))) \
    .withColumn("TagCount", F.size("TagsList")) \
    .withColumn("TitleLength", F.length("Title")) \
    .withColumn("BodyLength", F.length("Body")) \
    .withColumn("TitleWordCount", F.size(F.split("Title", " "))) \
    .withColumn("CreationDate", F.to_date("CreationDate")) \
    .withColumn("Year", F.year("CreationDate")) \
    .withColumn("Month", F.month("CreationDate"))

# Add quality label
df_transformed = df_transformed.withColumn(
    "QualityLabel",
    F.when((F.col("Score") >= 30) & (F.col("HasAcceptedAnswer") == 1), 2)  # High
    .when(F.col("Score") >= 0, 1)  # Medium
    .otherwise(0)  # Low
)

print("Transformed Data:")
df_transformed.select("Title", "Score", "TagCount", "TitleLength", "QualityLabel").show(10, truncate=40)

## 5. Exploratory Data Analysis (EDA)

In [None]:
# Convert to Pandas for visualization
pdf = df_transformed.toPandas()

# Summary Statistics
print("=" * 50)
print("SUMMARY STATISTICS")
print("=" * 50)
print(f"Total Questions: {len(pdf)}")
print(f"Average Score: {pdf['Score'].mean():.2f}")
print(f"Average Views: {pdf['ViewCount'].mean():.0f}")
print(f"Questions with Accepted Answer: {pdf['HasAcceptedAnswer'].sum()}")
print(f"Average Tags per Question: {pdf['TagCount'].mean():.2f}")

In [None]:
# Helper class untuk chart (compatible with Colab)
class ColabChartGenerator:
    """Chart Generator yang kompatibel dengan Google Colab."""
    
    def __init__(self, figsize=(12, 6), interactive=None):
        self.figsize = figsize
        self.colors = sns.color_palette("husl", 10)
        self.hex_colors = ['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231',
                          '#911eb4', '#46f0f0', '#f032e6', '#bcf60c', '#fabebe']
        
        # Auto-detect: use interactive in Colab if Plotly available
        if interactive is None:
            self.interactive = IN_COLAB and PLOTLY_AVAILABLE
        else:
            self.interactive = interactive and PLOTLY_AVAILABLE
    
    def plot_bar(self, df, x_column, y_column, title="Bar Chart", 
                 horizontal=False, top_n=None, color=None):
        """Plot bar chart - interactive di Colab."""
        plot_df = df.copy()
        if top_n:
            plot_df = plot_df.nlargest(top_n, y_column)
        
        if self.interactive:
            if horizontal:
                fig = px.bar(plot_df, x=y_column, y=x_column, orientation='h',
                           title=title, color_discrete_sequence=self.hex_colors)
            else:
                fig = px.bar(plot_df, x=x_column, y=y_column,
                           title=title, color_discrete_sequence=self.hex_colors)
            fig.update_layout(template='plotly_white')
            return fig
        else:
            fig, ax = plt.subplots(figsize=self.figsize)
            if horizontal:
                ax.barh(plot_df[x_column], plot_df[y_column], color=color or 'steelblue')
            else:
                ax.bar(plot_df[x_column], plot_df[y_column], color=color or 'steelblue')
                plt.xticks(rotation=45, ha='right')
            ax.set_title(title)
            plt.tight_layout()
            return fig
    
    def plot_pie(self, df, values_column, labels_column, title="Distribution", colors=None):
        """Plot pie chart."""
        if self.interactive:
            fig = px.pie(df, values=values_column, names=labels_column, title=title)
            fig.update_traces(textposition='inside', textinfo='percent+label')
            fig.update_layout(template='plotly_white')
            return fig
        else:
            fig, ax = plt.subplots(figsize=(8, 8))
            ax.pie(df[values_column], labels=df[labels_column], 
                   autopct='%1.1f%%', colors=colors)
            ax.set_title(title)
            return fig
    
    def plot_scatter(self, df, x_column, y_column, title="Scatter Plot",
                    color_column=None, size_column=None):
        """Plot scatter chart - great for Colab interactivity."""
        if self.interactive:
            fig = px.scatter(df, x=x_column, y=y_column, color=color_column,
                           size=size_column, title=title)
            fig.update_layout(template='plotly_white')
            return fig
        else:
            fig, ax = plt.subplots(figsize=self.figsize)
            ax.scatter(df[x_column], df[y_column], alpha=0.7, c='steelblue')
            ax.set_xlabel(x_column)
            ax.set_ylabel(y_column)
            ax.set_title(title)
            plt.tight_layout()
            return fig
    
    def plot_histogram(self, data, title="Distribution", bins=30, kde=True):
        """Plot histogram."""
        if self.interactive:
            fig = px.histogram(pd.DataFrame({'value': data}), x='value', 
                             nbins=bins, title=title, marginal='box')
            fig.update_layout(template='plotly_white')
            return fig
        else:
            fig, ax = plt.subplots(figsize=self.figsize)
            sns.histplot(data, bins=bins, kde=kde, ax=ax, color='steelblue')
            ax.set_title(title)
            plt.tight_layout()
            return fig
    
    def plot_heatmap(self, data, title="Heatmap", cmap='YlOrRd'):
        """Plot heatmap."""
        if self.interactive:
            fig = px.imshow(data, title=title, text_auto=True)
            fig.update_layout(template='plotly_white')
            return fig
        else:
            fig, ax = plt.subplots(figsize=self.figsize)
            sns.heatmap(data, annot=True, fmt='.1f', cmap=cmap, ax=ax)
            ax.set_title(title)
            plt.tight_layout()
            return fig
    
    def plot_line(self, df, x_column, y_column, title="Line Chart", 
                  group_column=None):
        """Plot line chart."""
        if self.interactive:
            if group_column:
                fig = px.line(df, x=x_column, y=y_column, color=group_column, title=title)
            else:
                fig = px.line(df, x=x_column, y=y_column, title=title)
            fig.update_layout(template='plotly_white')
            return fig
        else:
            fig, ax = plt.subplots(figsize=self.figsize)
            if group_column:
                for group in df[group_column].unique():
                    group_df = df[df[group_column] == group]
                    ax.plot(group_df[x_column], group_df[y_column], label=group)
                ax.legend()
            else:
                ax.plot(df[x_column], df[y_column])
            ax.set_title(title)
            ax.set_xlabel(x_column)
            ax.set_ylabel(y_column)
            plt.tight_layout()
            return fig
    
    def show(self, fig):
        """Display figure."""
        if hasattr(fig, 'show'):  # Plotly
            fig.show()
        else:  # Matplotlib
            plt.show()

# Initialize chart generator
charts = ColabChartGenerator(interactive=PLOTLY_AVAILABLE)
print(f"‚úÖ Chart Generator initialized (interactive={charts.interactive})")

In [None]:
# Interactive Visualizations (works great in Colab!)

# 1. Score Distribution - Interactive Histogram
fig1 = charts.plot_histogram(pdf['Score'], title='üìä Distribution of Question Scores', bins=15)
charts.show(fig1)

# 2. Quality Distribution - Interactive Pie
quality_df = pdf.groupby('QualityLabel').size().reset_index(name='count')
quality_df['Quality'] = quality_df['QualityLabel'].map({0: 'Low', 1: 'Medium', 2: 'High'})
fig2 = charts.plot_pie(quality_df, values_column='count', labels_column='Quality', 
                       title='üìà Question Quality Distribution')
charts.show(fig2)

# 3. Score vs Views - Interactive Scatter (hover untuk lihat detail!)
fig3 = charts.plot_scatter(pdf, x_column='Score', y_column='ViewCount', 
                          title='üîç Score vs View Count (Hover for details!)')
charts.show(fig3)

# 4. Questions by Month - Interactive Bar
monthly_df = pdf.groupby('Month').size().reset_index(name='Count')
fig4 = charts.plot_bar(monthly_df, x_column='Month', y_column='Count', 
                       title='üìÖ Questions by Month')
charts.show(fig4)

In [None]:
# Tag Analysis
from collections import Counter

# Flatten all tags
all_tags = [tag for tags in pdf['TagsList'] for tag in tags]
tag_counts = Counter(all_tags)

print("Top 10 Most Popular Tags:")
print("-" * 30)
for tag, count in tag_counts.most_common(10):
    print(f"{tag}: {count}")

# Tag WordCloud
wordcloud = WordCloud(
    width=800, height=400,
    background_color='white',
    colormap='viridis'
).generate_from_frequencies(tag_counts)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Stack Overflow Tags Word Cloud', fontsize=14)
plt.show()

## 6. NLP - Text Preprocessing

In [None]:
# Combine Title and Body for text analysis
df_text = df_transformed.withColumn(
    "CombinedText",
    F.concat_ws(" ", F.col("Title"), F.col("Body"))
)

# Clean text
@F.udf(StringType())
def clean_text(text):
    if not text:
        return ""
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    # Remove special characters
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df_text = df_text.withColumn("CleanText", clean_text(F.col("CombinedText")))

print("Cleaned Text Sample:")
df_text.select("Title", "CleanText").show(3, truncate=60)

In [None]:
# Tokenization
tokenizer = Tokenizer(inputCol="CleanText", outputCol="Words")
df_tokenized = tokenizer.transform(df_text)

# Remove Stopwords
from nltk.corpus import stopwords
stop_words = list(stopwords.words('english'))
# Add custom tech stopwords
stop_words.extend(['want', 'using', 'use', 'get', 'would', 'like', 'need', 'trying'])

remover = StopWordsRemover(inputCol="Words", outputCol="FilteredWords", stopWords=stop_words)
df_filtered = remover.transform(df_tokenized)

print("Tokenized and Filtered Words:")
df_filtered.select("Title", "FilteredWords").show(3, truncate=60)

## 7. NLP - Sentiment Analysis

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize VADER
sia = SentimentIntensityAnalyzer()

# Custom tech lexicon
tech_lexicon = {
    'solved': 2.0, 'works': 1.5, 'working': 1.0, 'fixed': 2.0, 'perfect': 2.5,
    'error': -2.0, 'bug': -1.5, 'broken': -2.0, 'crash': -2.5, 'fail': -2.0,
    'slow': -1.5, 'deprecated': -1.0, 'issue': -1.0, 'problem': -1.0
}
sia.lexicon.update(tech_lexicon)

# UDF for sentiment
@F.udf(StringType())
def get_sentiment(text):
    if not text:
        return "neutral"
    scores = sia.polarity_scores(text)
    compound = scores['compound']
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    else:
        return "neutral"

@F.udf(FloatType())
def get_sentiment_score(text):
    if not text:
        return 0.0
    return float(sia.polarity_scores(text)['compound'])

# Apply sentiment analysis
df_sentiment = df_filtered \
    .withColumn("Sentiment", get_sentiment(F.col("CombinedText"))) \
    .withColumn("SentimentScore", get_sentiment_score(F.col("CombinedText")))

print("Sentiment Analysis Results:")
df_sentiment.select("Title", "Sentiment", "SentimentScore").show(10, truncate=45)

In [None]:
# Sentiment Distribution - Interactive Bar Chart
sentiment_counts = df_sentiment.groupBy("Sentiment").count().toPandas()

# Using our Colab-compatible chart generator
fig = charts.plot_bar(sentiment_counts, x_column='Sentiment', y_column='count',
                     title='üé≠ Question Sentiment Distribution')
charts.show(fig)

# Sentiment vs Score - Interactive Scatter
pdf_sentiment = df_sentiment.select("Title", "Sentiment", "SentimentScore", "Score").toPandas()

if charts.interactive:
    fig2 = px.scatter(pdf_sentiment, x='SentimentScore', y='Score', 
                      color='Sentiment', hover_data=['Title'],
                      title='üí° Sentiment Score vs Question Score (Hover to see question!)',
                      color_discrete_map={'positive': '#6bcb77', 'neutral': '#4d96ff', 'negative': '#ff6b6b'})
    fig2.update_layout(template='plotly_white')
    fig2.show()
else:
    colors = {'positive': '#6bcb77', 'neutral': '#4d96ff', 'negative': '#ff6b6b'}
    plt.figure(figsize=(10, 6))
    for sentiment in pdf_sentiment['Sentiment'].unique():
        data = pdf_sentiment[pdf_sentiment['Sentiment'] == sentiment]
        plt.scatter(data['SentimentScore'], data['Score'], 
                   label=sentiment, color=colors.get(sentiment, 'gray'), alpha=0.7)
    plt.xlabel('Sentiment Score')
    plt.ylabel('Question Score')
    plt.title('Sentiment Score vs Question Score')
    plt.legend()
    plt.tight_layout()
    plt.show()

## 8. NLP - TF-IDF Feature Extraction

In [None]:
# TF-IDF
hashingTF = HashingTF(inputCol="FilteredWords", outputCol="RawFeatures", numFeatures=1000)
df_tf = hashingTF.transform(df_sentiment)

idf = IDF(inputCol="RawFeatures", outputCol="TFIDFFeatures")
idf_model = idf.fit(df_tf)
df_tfidf = idf_model.transform(df_tf)

print("TF-IDF Features created successfully!")
df_tfidf.select("Title", "TFIDFFeatures").show(3, truncate=50)

## 9. Machine Learning - Quality Prediction

In [None]:
from pyspark.ml.feature import VectorAssembler

# Prepare features
df_ml = df_tfidf.withColumn("label", F.col("QualityLabel").cast("double"))

# Numeric features
assembler = VectorAssembler(
    inputCols=["TitleLength", "BodyLength", "TagCount", "TitleWordCount", "SentimentScore"],
    outputCol="NumericFeatures"
)
df_ml = assembler.transform(df_ml)

# Combine with TF-IDF
from pyspark.ml.feature import VectorAssembler
final_assembler = VectorAssembler(
    inputCols=["NumericFeatures", "TFIDFFeatures"],
    outputCol="features"
)
df_final = final_assembler.transform(df_ml)

print("Features prepared for ML!")
print(f"Total samples: {df_final.count()}")

In [None]:
# Split data
train_df, test_df = df_final.randomSplit([0.8, 0.2], seed=42)

print(f"Training samples: {train_df.count()}")
print(f"Testing samples: {test_df.count()}")

# Train Random Forest
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="label",
    numTrees=50,
    maxDepth=10,
    seed=42
)

rf_model = rf.fit(train_df)
print("Random Forest model trained!")

In [None]:
# Predictions
predictions = rf_model.transform(test_df)

# Evaluate
evaluator = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

accuracy = evaluator.evaluate(predictions)

# F1 Score
evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)
f1 = evaluator_f1.evaluate(predictions)

print("=" * 50)
print("MODEL EVALUATION RESULTS")
print("=" * 50)
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")

In [None]:
# Show predictions
print("Sample Predictions:")
predictions.select("Title", "label", "prediction", "Score").show(10, truncate=40)

## 10. Trend Analysis

In [None]:
# Tag Trend Analysis
from pyspark.sql.functions import explode

# Explode tags
df_tags = df_transformed.select(
    "Id", "Year", "Month", "Score", "ViewCount",
    explode("TagsList").alias("Tag")
)

# Tag statistics
tag_stats = df_tags.groupBy("Tag").agg(
    F.count("Id").alias("QuestionCount"),
    F.avg("Score").alias("AvgScore"),
    F.sum("ViewCount").alias("TotalViews")
).orderBy(F.desc("QuestionCount"))

print("Tag Statistics:")
tag_stats.show(15, truncate=False)

# Interactive Tag Visualization
tag_stats_pdf = tag_stats.toPandas()

# Top 10 Tags - Interactive Bar
fig = charts.plot_bar(tag_stats_pdf.head(10), x_column='Tag', y_column='QuestionCount',
                     title='üè∑Ô∏è Top 10 Most Popular Tags')
charts.show(fig)

# Tag Score vs Views - Interactive Scatter (Bubble Chart)
if charts.interactive:
    fig2 = px.scatter(tag_stats_pdf, x='AvgScore', y='TotalViews', 
                     size='QuestionCount', hover_name='Tag',
                     title='üéØ Tag Performance (Size = Question Count, Hover for Tag!)',
                     color='QuestionCount', color_continuous_scale='viridis')
    fig2.update_layout(template='plotly_white')
    fig2.show()
else:
    plt.figure(figsize=(12, 6))
    plt.scatter(tag_stats_pdf['AvgScore'], tag_stats_pdf['TotalViews'], 
               s=tag_stats_pdf['QuestionCount']*20, alpha=0.6, c='steelblue')
    for i, row in tag_stats_pdf.iterrows():
        plt.annotate(row['Tag'], (row['AvgScore'], row['TotalViews']), fontsize=8)
    plt.xlabel('Average Score')
    plt.ylabel('Total Views')
    plt.title('Tag Performance (Size = Question Count)')
    plt.tight_layout()
    plt.show()

In [None]:
# Monthly Trend - Interactive Dual Axis Chart
monthly_trend = df_transformed.groupBy("Year", "Month").agg(
    F.count("Id").alias("QuestionCount"),
    F.avg("Score").alias("AvgScore")
).orderBy("Year", "Month").toPandas()

monthly_trend['Period'] = monthly_trend['Year'].astype(str) + '-' + monthly_trend['Month'].astype(str).str.zfill(2)

if charts.interactive:
    from plotly.subplots import make_subplots
    
    fig = make_subplots(specs=[[{"secondary_y": True}]])
    
    fig.add_trace(
        go.Bar(x=monthly_trend['Period'], y=monthly_trend['QuestionCount'], 
               name="Question Count", marker_color='steelblue', opacity=0.7),
        secondary_y=False,
    )
    
    fig.add_trace(
        go.Scatter(x=monthly_trend['Period'], y=monthly_trend['AvgScore'], 
                   name="Avg Score", line=dict(color='red', width=3), mode='lines+markers'),
        secondary_y=True,
    )
    
    fig.update_layout(
        title_text="üìà Monthly Question Trend (Interactive!)",
        template='plotly_white',
        hovermode='x unified'
    )
    fig.update_xaxes(title_text="Period")
    fig.update_yaxes(title_text="Question Count", secondary_y=False, color='steelblue')
    fig.update_yaxes(title_text="Average Score", secondary_y=True, color='red')
    
    fig.show()
else:
    fig, ax1 = plt.subplots(figsize=(12, 5))
    
    ax1.bar(monthly_trend['Period'], monthly_trend['QuestionCount'], color='steelblue', alpha=0.7)
    ax1.set_xlabel('Period')
    ax1.set_ylabel('Question Count', color='steelblue')
    ax1.tick_params(axis='x', rotation=45)
    
    ax2 = ax1.twinx()
    ax2.plot(monthly_trend['Period'], monthly_trend['AvgScore'], color='red', marker='o', linewidth=2)
    ax2.set_ylabel('Average Score', color='red')
    
    plt.title('Monthly Question Trend', fontsize=14)
    plt.tight_layout()
    plt.show()

## 11. Summary & Conclusions

In [None]:
print("=" * 60)
print("STACK OVERFLOW ANALYTICS - SUMMARY")
print("=" * 60)
print()
print("1. DATA OVERVIEW")
print(f"   - Total Questions Analyzed: {df.count()}")
print(f"   - Date Range: {pdf['CreationDate'].min()} to {pdf['CreationDate'].max()}")
print(f"   - Unique Tags: {len(tag_counts)}")
print()
print("2. TOP TECHNOLOGIES")
for i, (tag, count) in enumerate(tag_counts.most_common(5), 1):
    print(f"   {i}. {tag}: {count} questions")
print()
print("3. QUALITY METRICS")
print(f"   - High Quality Questions: {(pdf['QualityLabel'] == 2).sum()}")
print(f"   - Medium Quality Questions: {(pdf['QualityLabel'] == 1).sum()}")
print(f"   - Low Quality Questions: {(pdf['QualityLabel'] == 0).sum()}")
print()
print("4. ML MODEL PERFORMANCE")
print(f"   - Algorithm: Random Forest")
print(f"   - Accuracy: {accuracy:.2%}")
print(f"   - F1 Score: {f1:.2%}")
print()
print("5. KEY INSIGHTS")
print("   - Python remains the most discussed technology")
print("   - Questions with more tags tend to get higher engagement")
print("   - Sentiment analysis helps identify problematic questions")
print()
print("=" * 60)

## üì± Google Colab Tips

Beberapa tips untuk menggunakan notebook ini di Google Colab:

### Interactive Charts
- Semua chart di notebook ini sudah **interactive** ketika dijalankan di Colab
- Hover di atas data point untuk melihat detail
- Zoom in/out dengan scroll
- Pan dengan drag
- Double-click untuk reset view

### Menyimpan ke Google Drive
Jalankan cell di bawah untuk menyimpan hasil analisis ke Google Drive.

In [None]:
# Simpan hasil ke Google Drive (khusus Colab)
def save_results_to_drive():
    """Simpan semua hasil analisis ke Google Drive."""
    if not IN_COLAB:
        print("‚ùå Function ini hanya tersedia di Google Colab")
        return
    
    from google.colab import drive
    import os
    
    # Mount Drive
    if not os.path.exists('/content/drive'):
        drive.mount('/content/drive')
    
    # Buat folder
    save_dir = '/content/drive/MyDrive/stackoverflow_analytics'
    os.makedirs(save_dir, exist_ok=True)
    
    # Save DataFrames
    pdf.to_csv(f'{save_dir}/questions_analyzed.csv', index=False)
    tag_stats_pdf.to_csv(f'{save_dir}/tag_statistics.csv', index=False)
    monthly_trend.to_csv(f'{save_dir}/monthly_trend.csv', index=False)
    
    # Save summary
    summary = f"""
Stack Overflow Analytics Summary
================================
Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

Total Questions: {len(pdf)}
Unique Tags: {len(tag_stats_pdf)}
Date Range: {pdf['CreationDate'].min()} to {pdf['CreationDate'].max()}

Model Performance:
- Accuracy: {accuracy:.4f}
- F1 Score: {f1:.4f}

Top 5 Tags:
{tag_stats_pdf.head(5)[['Tag', 'QuestionCount']].to_string()}
"""
    
    with open(f'{save_dir}/summary.txt', 'w') as f:
        f.write(summary)
    
    print(f"‚úÖ Results saved to: {save_dir}")
    print("   - questions_analyzed.csv")
    print("   - tag_statistics.csv")
    print("   - monthly_trend.csv")
    print("   - summary.txt")

# Uncomment untuk menyimpan:
# save_results_to_drive()

In [None]:
# Stop Spark Session
spark.stop()
print("Spark session stopped. Analysis complete!")