# üöÄ Analyse syntaktischer N-Gramme  

## Hinweise zur Ausf√ºhrung des Notebooks
Dieses **Notebook** kann auf unterschiedlichen Levels erarbeitet werden (siehe Abschnitt ["Technische Voraussetzungen"](../introduction/introduction_requirements)): 
1. Book-Only Mode
2. Cloud Mode: Daf√ºr auf üöÄ klicken und z.B. in Colab ausf√ºhren.
3. Local Mode: Daf√ºr auf Herunterladen ‚Üì klicken und ".ipynb" w√§hlen. 

## √úbersicht 
Im Folgenden wird das Korpus mithilfe **syntaktischer n-Gramme** analysiert. Ziel dieser Analyse ist es, wiederkehrende **grammatische Muster** (insbesondere adjektivische Modifikationen von Substantiven) √ºber Zeit zu untersuchen und damit √ºber rein lexikalische H√§ufigkeiten hinauszugehen. Im Unterschied zur Analyse semantischer Felder stehen hier **strukturierte, syntaktisch motivierte Wortkombinationen** im Fokus, die auf Abh√§ngigkeitsbeziehungen basieren.

Konkret wird untersucht, wie h√§ufig bestimmte syntaktische Konstruktionen (z. B. *Adjektiv ‚Üí Substantiv*-Relationen) im Korpus auftreten und wie sich deren Vorkommen zeitlich entwickelt. Dies erlaubt es, stilistische und diskursive Entwicklungen sichtbar zu machen, etwa Ver√§nderungen in der Beschreibung bestimmter Konzepte √ºber l√§ngere Zeitr√§ume hinweg.

Dazu werden die folgenden Schritte durchgef√ºhrt:

1. Einlesen des Korpus, der Metadaten sowie der bereits erzeugten spaCy-Annotationen
2. Auswahl relevanter Dependency-Relationen zur Bildung syntaktischer n-Gramme
3. Extraktion syntaktischer n-Gramme auf Basis der vorhandenen Abh√§ngigkeitsinformationen
4. Aggregation der syntaktischen Muster nach Zeit (z. B. Jahre oder Zeitintervalle)
5. Analyse und Visualisierung der zeitlichen Entwicklung ausgew√§hlter syntaktischer n-Gramme
6. Diskussion der Ergebnisse im Hinblick auf stilistische und diskursive Ver√§nderungen im Korpus

## Import der Bibliotheken

In [None]:
from pathlib import Path
from typing import Dict, List, Union, Tuple
from collections import OrderedDict, Counter
from time import time
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
import plotly.graph_objects as go
from itables import show
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin, Doc

Laden des spaCy-Modells

In [None]:
! python -m spacy download de_core_news_sm
nlp = spacy.load("de_core_news_sm")

## Laden der Annotationen

In diesem Schritt werden die zuvor erzeugten spaCy-Annotationen aus dem Dateisystem eingelesen. Die gespeicherten DocBin-Dateien werden zu vollst√§ndigen spaCy.Doc-Objekten rekonstruiert und in einer Datenstruktur abgelegt, die eine weitere Analyse erlaubt.

In [None]:
# Bei Verwendung eines anderen Korpus hier den Verzeichnisnamen anpassen
annotation_dir = Path("../data/spacy")

if not annotation_dir.exists():
    print("The directory does not exist, please check the path again.")

In [None]:
# Create dictionary to save the corpus data (filenames and tables)
annotated_docs = {}

start = time()
# Iterate over spacy files
for fp in tqdm(annotation_dir.iterdir(), desc="Reading annotated data"):
    # check if the entry is a file, not a directory
    if fp.is_file():
        # check if the file has the correct suffix spacy
        if fp.suffix == '.spacy':
            print( f"Loading file: {fp.name}" )
            # load spacy DocBin objects
            doc_bin = DocBin().from_disk(fp)
            chunk_docs = list(doc_bin.get_docs(nlp.vocab))
            # merge bins into one single document
            full_doc = Doc.from_docs(chunk_docs)

            # save the data frame to the dictionary, key=filename (without suffix), value=spacy.Doc
            annotated_docs[fp.stem] = full_doc
took = time() - start
print(f"Loading the data took: {round(took, 4)} seconds")

In [None]:
print(f"Annotations of first 20 lines of the text: {list(annotated_docs.keys())[0]}:\n")
print("Token\tLemma\tPoS")
for token in annotated_docs[list(annotated_docs.keys())[0]][:20]:
    print(f"{token.text}\t{token.lemma_}\t{token.pos_}")

## Metadaten einlesen

Anschlie√üend werden die zugeh√∂rigen Metadaten geladen und auf diejenigen Texte beschr√§nkt, f√ºr die Annotationen vorliegen. Die zeitlichen Angaben werden in ein geeignetes Datumsformat √ºberf√ºhrt, um sp√§tere zeitbasierte Aggregationen und Visualisierungen zu erleichtern.

In [None]:
metadata_df = pd.read_csv("../metadata/metadata_corpus-german_language_fiction_1820-1900_50-per-decade.csv")
# metadata_df = metadata_df[metadata_df['ID'].isin(annotated_docs.keys())]
# Datentyp der Datumsspalte f√ºr eine einfachere Weiterverarbeitung √§ndern
metadata_df['year'] = pd.to_datetime(metadata_df['year'], format="%Y")

In [None]:
show(metadata_df)

In [None]:
metadata_df_alt = pd.read_csv("../metadata/metadata_corpus-german_language_fiction_1820-1900_50-per-decade_ALT.csv")
# metadata_df_alt = metadata_df_alt[metadata_df_alt['ID'].isin(annotated_docs.keys())]
# Datentyp der Datumsspalte f√ºr eine einfachere Weiterverarbeitung √§ndern
metadata_df_alt['year'] = pd.to_datetime(metadata_df_alt['year'], format="%Y")

In [None]:
show(metadata_df_alt)

## Syntaktische N-Gramme extrahieren

In einem weiteren Schritt k√∂nnen wir die Adjektive extrahieren, die mit dem Nomen Luft in Verbindung stehen. Wir machen dabei Gebrauch von den Dependenzstrukturen, die sich durch das spaCy-eigene `Doc` einfach navigieren lassen. 

In [None]:
def extract_dependent_adjective_list(spacy_docs: Dict, metadata_df: pd.DataFrame,
                                     noun_input: Union[str, List[str]], top_n: int = 10) -> Tuple[pd.DataFrame, List[str]]:
    """
    Extract adjective modifiers (amod) for a noun or list of nouns and track their frequency over time.

    Parameters:
    -----------
    spacy_docs : dict
        Dictionary with file_ids as keys and spaCy Doc objects as values
    metadata_df : pd.DataFrame
        DataFrame with columns: 'lastname', 'firstname', 'title', 'year', 'volume', 'ID', 'decade'
    noun_input : str or list of str
        Single noun lemma (e.g., 'liebe') or list of noun lemmata (e.g., ['liebe', 'leidenschaft'])
    top_n : int
        Number of most frequent adjectives to extract (default: 10)

    Returns:
    --------
    tuple : (pd.DataFrame, list)
        - DataFrame with columns: filename, title, year, adjective, count, noun_count
        - List of the top N adjectives found
    """
    # Convert single noun to list for uniform processing
    if isinstance(noun_input, str):
        noun_list = [noun_input]
    else:
        noun_list = noun_input

    # Convert to lowercase for case-insensitive matching
    noun_list_lower = [noun.lower() for noun in noun_list]

    # First pass: count all adjectives modifying any noun in the list across entire corpus
    all_adjectives = Counter()

    for file_id, doc in spacy_docs.items():
        for token in doc:
            # Check if this token is one of our target nouns
            if token.lemma_.lower() in noun_list_lower and token.pos_ == 'NOUN':
                # Find any dependent adjective
                for child in token.children:
                    #if child.dep_ == 'amod' and child.pos_ == 'ADJ':
                    if child.pos_ == 'ADJ':
                        all_adjectives[child.lemma_.lower()] += 1

    # Get top N most frequent adjectives
    top_adjectives = [adj for adj, count in all_adjectives.most_common(top_n)]

    # Second pass: calculate frequencies per document
    results = []

    for file_id, doc in spacy_docs.items():
        # Get metadata for this file
        meta_row = metadata_df[metadata_df['ID'] == file_id]

        if meta_row.empty:
            continue

        # Count adjectives modifying the target nouns
        adjective_counts = Counter()
        noun_count = 0

        for token in doc:
            # Check if this token is one of our target nouns
            if token.lemma_.lower() in noun_list_lower and token.pos_ == 'NOUN':
                noun_count += 1
                #print('found noun:', token.text)
                # Find adjective modifiers (amod dependency)
                for child in token.children:
                    #print('child dep:', child.dep_, 'pos:', child.pos_)
                    #if child.dep_ == 'amod' and child.pos_ == 'ADJ':
                    if child.pos_ == 'ADJ':
                        #print('  found dependent adjective:', child.text)
                        adjective_counts[child.lemma_.lower()] += 1

        # Create a row for each top adjective found in this document
        # (even if count is 0, we want to track that)
        for adjective in top_adjectives:
            count = adjective_counts.get(adjective, 0)
            if noun_count > 0 or count > 0:  # Include if we have nouns or this adjective
                results.append({
                    'filename': file_id,
                    'title': meta_row['title'].values[0],
                    'year': meta_row['year'].values[0],
                    'adjective': adjective,
                    'count': count,
                    'noun_count': noun_count
                })

    return pd.DataFrame(results), top_adjectives


def get_top_adjectives_list(adj_df: pd.DataFrame, top_n: int = 10) -> list:
    """
    Get the top N most frequent adjectives from the adjective dataframe.

    Parameters:
    -----------
    adj_df : pd.DataFrame
        DataFrame returned by extract_adjective_modifiers_list()
    top_n : int
        Number of most frequent adjectives to return

    Returns:
    --------
    list
        List of top N adjectives
    """
    total_counts = adj_df.groupby('adjective')['count'].sum().sort_values(ascending=False)
    return total_counts.head(top_n).index.tolist()


def plot_adjective_trends_list(adj_df: pd.DataFrame, top_adjectives: list,
                               noun_input: Union[str, List[str]]):
    """
    Create a plot showing adjective modifier trends over time for a noun or noun list.

    Parameters:
    -----------
    adj_df : pd.DataFrame
        DataFrame returned by extract_adjective_modifiers_list()
    top_adjectives : list
        List of adjectives to plot (e.g., from get_top_adjectives_list())
    noun_input : str or list of str
        The noun(s) being analyzed
    show_individual_texts : bool
        If True, show individual text data points with titles; if False, show yearly means only

    Returns:
    --------
    
    .graph_objects.Figure
        The figure object (will display automatically in Jupyter)
    """
    # Filter for only the top adjectives
    filtered_df = adj_df[adj_df['adjective'].isin(top_adjectives)].copy()

    # Calculate relative frequency (per 100 noun occurrences)
    # Handle division by zero
    filtered_df['rel_freq'] = filtered_df.apply(
        lambda row: (row['count'] / row['noun_count']) * 100 if row['noun_count'] > 0 else 0,
        axis=1
    )

    # Create figure
    fig = go.Figure()


    # Show individual texts as scatter points
    for adj in top_adjectives:
        adj_data = filtered_df[filtered_df['adjective'] == adj]

        fig.add_trace(go.Scatter(
            x=adj_data['year'],
            y=adj_data['rel_freq'],
            mode='markers',
            name=adj,
            text=adj_data['title'],
            hovertemplate='<b>%{fullData.name}</b><br>' +
                         '<b>%{text}</b><br>' +
                         'Year: %{x}<br>' +
                         'Frequency: %{y:.2f} per 100 occurrences<br>' +
                         '<extra></extra>',
            marker=dict(size=8, opacity=0.7)
        ))

    # Create title based on input
    if isinstance(noun_input, str):
        title_noun = f'"{noun_input}"'
    else:
        noun_str = ', '.join(noun_input[:3])
        if len(noun_input) > 3:
            noun_str += f', ... ({len(noun_input)} total)'
        title_noun = f'[{noun_str}]'

    # Update layout
    fig.update_layout(
        title=f'Adjective Syntactic Dependents of {title_noun} Over Time',
        xaxis_title='Year',
        yaxis_title=f'Relative Frequency (per 100 noun occurrences)',
        hovermode='closest',
        height=600,
        legend=dict(
            title='Adjectives',
            yanchor="top",
            y=0.99,
            xanchor="right",
            x=0.99
        )
    )

    return fig


def plot_adjective_trends_moving_avg_plotly(
    adj_df: pd.DataFrame,
    top_adjectives: list,
    noun_input,
    window_years: int = 10,
    n_plot: int = 8,
    value_col: str = "rel_freq",
    show_points: bool = True,
):
    """
    Plotly lineplot per year for adjective dependents + centered moving average window.

    Expects adj_df to have at least: year, adjective, count, noun_count
    If value_col (default: rel_freq) is missing, it will be computed as (count / noun_count) * 100.
    """

    df = adj_df.copy()

    # --- Ensure year is integer year (avoid datetime nanoseconds weirdness) ---
    if "year" not in df.columns:
        raise ValueError("adj_df must contain a 'year' column.")

    if pd.api.types.is_datetime64_any_dtype(df["year"]):
        df["year"] = df["year"].dt.year
    else:
        df["year"] = pd.to_numeric(df["year"], errors="coerce")

    df = df.dropna(subset=["year"])
    df["year"] = df["year"].astype(int)

    # --- Compute relative frequency if needed ---
    if value_col not in df.columns:
        if not {"count", "noun_count"}.issubset(df.columns):
            raise ValueError(
                f"adj_df must contain '{value_col}' or both 'count' and 'noun_count'."
            )
        df[value_col] = df.apply(
            lambda row: (row["count"] / row["noun_count"]) * 100 if row["noun_count"] else 0,
            axis=1,
        )

    # --- Filter adjectives ---
    df = df[df["adjective"].isin(top_adjectives)].copy()
    if df.empty:
        raise ValueError("After filtering by top_adjectives, no rows remain.")

    # --- Yearly aggregate (mean across texts) ---
    yearly = (
        df.groupby(["year", "adjective"])[value_col]
          .mean()
          .unstack("adjective")
          .sort_index()
    )

    # --- Moving average on yearly aggregates ---
    moving = yearly.rolling(window=window_years, center=True, min_periods=1).mean()

    # --- Limit to n_plot adjectives (keep original top_adjectives order if possible) ---
    cols_in_data = [a for a in top_adjectives if a in moving.columns]
    cols_to_plot = cols_in_data[:n_plot] if cols_in_data else list(moving.columns)[:n_plot]
    moving_plot = moving[cols_to_plot].copy()

    # --- Build title ---
    if isinstance(noun_input, str):
        title_noun = f'"{noun_input}"'
    else:
        noun_str = ", ".join(noun_input[:3])
        if len(noun_input) > 3:
            noun_str += f", ... ({len(noun_input)} total)"
        title_noun = f"[{noun_str}]"

    # --- Long format for plotly ---
    moving_long = (
        moving_plot.reset_index()
                  .melt(id_vars="year", var_name="adjective", value_name=value_col)
    )

    # --- Plotly line chart ---
    fig = px.line(
        moving_long,
        x="year",
        y=value_col,
        color="adjective",
        markers=show_points,
        title=f"Top {min(n_plot, len(cols_to_plot))} Adjectives ‚Äì {window_years}-Year Moving Average ‚Äì {title_noun}",
        labels={
            "year": "Year",
            value_col: f"{value_col} (moving avg, window={window_years}y)",
            "adjective": "Adjective",
        },
    )

    # Make x axis show integer years cleanly (no scientific notation)
    fig.update_xaxes(
        tickmode="linear",
        dtick=10,          # change to 5/1 if you want denser ticks
        tickformat="d",
    )

    fig.update_layout(
        width=1000,
        height=500,
        legend_title_text="",
        hovermode="x unified",
        margin=dict(l=40, r=40, t=70, b=40),
    )

    fig.show()
    return yearly, moving


def plot_top_adjective_ranking(
    adj_df: pd.DataFrame,
    top_n: int = 20,
    noun_input=None,
    metric: str = "count",   # "count" or "share"
):
    """
    Show a simple ranking of the most frequent adjective modifiers.

    metric:
      - "count": total raw modifier counts across corpus
      - "share": percentage share among all modifier counts (sums to 100)
    """
    totals = (
        adj_df.groupby("adjective")["count"]
              .sum()
              .sort_values(ascending=False)
              .head(top_n)
              .reset_index()
              .rename(columns={"count": "total_count"})
    )

    if metric == "share":
        totals["value"] = (totals["total_count"] / totals["total_count"].sum()) * 100
        xcol = "value"
        xlabel = "Share of all adjective modifiers (%)"
        hover = {"total_count": True, "value": ":.2f"}
    else:
        totals["value"] = totals["total_count"]
        xcol = "value"
        xlabel = "Total count (across corpus)"
        hover = {"total_count": True}

    # Make a readable title noun
    if isinstance(noun_input, str):
        title_noun = f'"{noun_input}"'
    elif isinstance(noun_input, list) and noun_input:
        noun_str = ", ".join(noun_input[:3]) + (f", ‚Ä¶ ({len(noun_input)})" if len(noun_input) > 3 else "")
        title_noun = f"[{noun_str}]"
    else:
        title_noun = ""

    fig = px.bar(
        totals[::-1],               # reverse for top at top in horizontal bar
        x=xcol,
        y="adjective",
        orientation="h",
        title=f"Top {top_n} adjective modifiers of {title_noun}",
        labels={xcol: xlabel, "adjective": "Adjective"},
        hover_data=hover,
    )

    fig.update_layout(
        height=450 + top_n * 12,  # scales nicely
        margin=dict(l=120, r=40, t=60, b=40),
        yaxis=dict(categoryorder="total ascending"),
    )

    fig.show()
    return totals


def show_top_adjectives_table(adj_df: pd.DataFrame, top_n: int = 20):
    totals = (
        adj_df.groupby("adjective")["count"]
              .sum()
              .sort_values(ascending=False)
              .head(top_n)
              .reset_index()
              .rename(columns={"count": "total_count"})
    )
    totals["rank"] = range(1, len(totals) + 1)
    totals = totals[["rank", "adjective", "total_count"]]

    fig = go.Figure(data=[go.Table(
        header=dict(values=list(totals.columns)),
        cells=dict(values=[totals[c] for c in totals.columns])
    )])
    fig.update_layout(title=f"Top {top_n} adjective modifiers (total counts)", height=400)
    fig.show()
    #return totals


def plot_top_adjectives_by_decade(adj_df: pd.DataFrame, metadata_df: pd.DataFrame, top_n: int = 8):
    # attach decade via filename/ID
    dec = metadata_df[["ID", "decade"]].rename(columns={"ID": "filename"})
    df = adj_df.merge(dec, on="filename", how="left").dropna(subset=["decade"]).copy()

    # totals per decade
    g = (df.groupby(["decade", "adjective"])["count"].sum().reset_index())
    # find global top_n adjectives (across all decades)
    top_adjs = (df.groupby("adjective")["count"].sum().sort_values(ascending=False).head(top_n).index.tolist())
    g = g[g["adjective"].isin(top_adjs)]

    fig = px.bar(
        g,
        x="decade",
        y="count",
        color="adjective",
        barmode="group",
        title=f"Top {top_n} adjective modifiers by decade (raw counts)",
        labels={"count": "Count", "decade": "Decade", "adjective": "Adjective"},
    )
    fig.update_layout(height=500, hovermode="x unified")
    fig.show()
    #return g

In [None]:
noun = "Luft"

In [None]:
adj_df, top_adjs = extract_dependent_adjective_list(annotated_docs, metadata_df, noun, top_n=10)


In [None]:
adj_df_alt, top_adjs_alt = extract_dependent_adjective_list(annotated_docs, metadata_df_alt, noun, top_n=10)

In [None]:
# Save adj_df to CSV
adj_df.to_csv("adj_df_luft.csv", index=False)
print(f"Saved adj_df to adj_df_luft.csv ({len(adj_df)} rows)")

In [None]:
# Save adj_df_alt to CSV
adj_df_alt.to_csv("adj_df_alt_luft.csv", index=False)
print(f"Saved adj_df_alt to adj_df_alt_luft.csv ({len(adj_df_alt)} rows)")

## Laden gespeicherter Ergebnisse (Optional)

Falls die Ergebnisse bereits extrahiert und als CSV gespeichert wurden, k√∂nnen sie hier geladen werden, anstatt die Extraktion erneut durchzuf√ºhren.

In [None]:
# Optional: Load previously saved results from CSV

# Load adj_df
adj_df = pd.read_csv("adj_df_luft.csv")
adj_df['year'] = pd.to_datetime(adj_df['year'])

# Load adj_df_alt
adj_df_alt = pd.read_csv("adj_df_alt_luft.csv")
adj_df_alt['year'] = pd.to_datetime(adj_df_alt['year'])

# Recreate top_adjs lists from the dataframes
top_adjs = adj_df.groupby('adjective')['count'].sum().sort_values(ascending=False).head(10).index.tolist()
top_adjs_alt = adj_df_alt.groupby('adjective')['count'].sum().sort_values(ascending=False).head(10).index.tolist()

## Analyse und Visualisierung

### 1. Korpusweite H√§ufigkeiten adjektivischer Modifikatoren

Zun√§chst betrachten wir die Gesamtverteilung der adjektivischen Modifikatoren. Bevor zeitliche Entwicklungen analysiert werden, ist es sinnvoll, einen √úberblick dar√ºber zu gewinnen, welche Adjektive im gesamten Korpus am h√§ufigsten als syntaktische Modifikatoren der untersuchten Substantive auftreten. Diese aggregierte Betrachtung erlaubt es, dominante Beschreibungs- und Bewertungsmuster zu identifizieren und dient zugleich als Ausgangspunkt f√ºr die nachfolgenden diachronen Analysen.

#### Sample 1:

In [None]:
totals = plot_top_adjective_ranking(adj_df, top_n=20, noun_input=noun, metric="count")

In [None]:
show_top_adjectives_table(adj_df)

#### Sample 2:

In [None]:
totals_alt = plot_top_adjective_ranking(adj_df_alt, top_n=20, noun_input=noun, metric="count")

In [None]:
show_top_adjectives_table(adj_df_alt)

### 2. Diachrone Analyse adjektivisch-substantivischer Konstruktionen

Im n√§chsten Schritt wird die Analyse um eine diachrone Perspektive erweitert. Anstatt ausschlie√ülich korpusweite Gesamth√§ufigkeiten zu betrachten, wird nun untersucht, wie sich die Verwendung der zuvor identifizierten adjektivischen Modifikatoren im Zeitverlauf entwickelt. Die zeitliche Aggregation erlaubt es, Verschiebungen in Beschreibungs- und Bewertungsmustern nachzuzeichnen und diese mit historischen Prozessen in Beziehung zu setzen.

#### Sample 1

In [None]:
# yearly lineplots + moving average
yearly_1, moving_1 = plot_adjective_trends_moving_avg_plotly(
    adj_df, top_adjs, noun_input=noun, window_years=10, n_plot=8
)

In [None]:
plot_top_adjectives_by_decade(adj_df, metadata_df, top_n=8)

In [None]:
# sample 1
plot_adjective_trends_list(adj_df, top_adjs, noun)

#### Sample 2:

In [None]:
# NEW: yearly lineplots + moving average
yearly_2, moving_2 = plot_adjective_trends_moving_avg_plotly(
    adj_df_alt, top_adjs_alt, noun_input=noun, window_years=10, n_plot=8
)

In [None]:
plot_top_adjectives_by_decade(adj_df_alt, metadata_df_alt, top_n=8)

In [None]:
# sample 2
plot_adjective_trends_list(adj_df_alt, top_adjs_alt, noun)

## Add plots for specific words that we found

TO BE CHANGED / UPDATED / CONTINUED