## Modeling a non-negative matrix factorization topic model to identify and analyze the historical evolution of anthropogenic climate change impacts on water resources. 

### by Tanay Tunçer

<br>
<br>

### Table of Contents

1 Data <br>
2 Data pre-processing <br>
3 Exploratory Data Analysis <br>
4 Model parameter <br>
5 Topic Coherence Cv <br>
6 Latent Dirichlet Allocation <br>
7 Non-negative Matrix Factorization 
<br>
<br>

### Introduction 

Anthropogenic climate change represents one of the greatest challenges of the 21st century and is the subject of research to predict what economic, socio-economic and environmental changes can be expected in the future. The cause of climate change can be traced back to pre-industrial times and began with the first artificial emission of greenhouse gases into the atmosphere, which had a direct impact on the climate system. According to the Intergovernmental Panel on Climate Change, limiting warming to 1.5 degrees Celsius or 2.0 degrees Celsius is not achievable unless greenhouse gas emissions are reduced immediately and on a large scale. Climate change is triggering a variety of different changes in different regions, which will intensify as greenhouse gas emissions continue. Climate change is associated with changes in the water cycle, such as increasing water vapor content of the atmosphere, changing precipitation patterns, intensity and extremes, rising sea levels, decreasing snow and ice cover in mountains and water basins, flooding and drought, and changes in soil moisture. As a result of this challenge, it is important to improve knowledge of the effects of climate change on water as a resource. The evaluation of the practical suitability of the algorithm is assessed in this example by identifying the most influential impacts of climate change on water as a resource. The implemented models should be able to identify descriptive words that summarize the identified concepts as concisely as possible. The basis is the use of a collection of scientific articles with the highest consensus on the topic area to capture the most important topics on that area. The respective background topics of the research area are thus not taken into account.

<br>
<br>

In [None]:
import pandas as pd 
import numpy as np
import re
import os

import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

import plotly.express as px

from scipy import stats
import pingouin as pg

from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import Counter

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

pd.set_option("display.max_columns", None)
pd.get_option("display.max_rows")

%load_ext jupyternotify
from tqdm import tqdm, tqdm_notebook


## 1. Data

In [None]:
#Dateipfad ändern. Bitte beachten Sie, dass die Datei automatisch aus dem Dateipfad gelesen wird.
path = "C:/Users/tanaytuncer/Desktop/507145_Tanay_Tuncer_Bachelor/data"

#print(type(os.listdir(path)))
for p in tqdm_notebook(range(1)):   
    for document in os.listdir(path):
        document_dir = (path + "/" + document)
        print(document_dir)

    text = pd.read_csv(document_dir, delimiter = ",")
    text

In [None]:
text.info()

In [None]:
text.isnull().sum()

In [None]:
text["Cited by"] = text["Cited by"].fillna(0)

text.drop(text[text["Abstract"]=="[No abstract available]"].index, inplace=True)
text = text[text["Cited by"] >= 150]

text.head()

## 2. Data pre-processing

In [None]:
corpus = text.copy()

def remove_stopwords(tokens):
    text = [x for x in tokens if x not in nltk.corpus.stopwords.words("english")]
    return text

def remove_punctuation(tokens):
    import string
    text = " ".join([x for x in tokens if x not in string.punctuation])
    text = " ".join([x for x in tokens if x not in string.digits])
    return text

def remove_words(tokens):
    words = ["δ18o", "paper"]
    text = [x for x in tokens if x not in words]
    text = [x for x in text if len(x) >= 3]
    #text = [x for x in text if x != ","]
    return text

    
def get_pos_tag(pos_tag):
    if pos_tag in ["JJ", "JJR", "JJS"]:
        return wordnet.ADJ
    elif pos_tag in ["RB", "RBR", "RBS"]:
        return wordnet.ADV
    elif pos_tag in ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]:
        return wordnet.VERB
    else: 
        return wordnet.NOUN

def lemmatize_words(tokens):
    pos_tags = nltk.pos_tag(tokens)
    words = " ".join([WordNetLemmatizer().lemmatize(i) for i in tokens])
    return words

def extract_noun (tokens):
    pos_tags = nltk.pos_tag(tokens)
    tags = ["NN", "NNS", "NNP", "NNPS"]
    words = [i[0] for i in pos_tags if i[1] in tags]
    return words
                      

In [None]:
%%notify
for p in tqdm_notebook(range(10)):    
    corpus["Abstract"] = corpus["Abstract"].map(lambda x: x.lower())
    corpus["Abstract"] = corpus["Abstract"].map(lambda x: RegexpTokenizer(r"\w+").tokenize(x))
    corpus["Abstract"] = corpus["Abstract"].apply(lambda x: remove_words(x))
    corpus["Abstract"] = corpus["Abstract"].map(lambda x: remove_stopwords(x))
    corpus["Abstract"] = corpus["Abstract"].apply(lambda x: extract_noun(x))
    corpus["Abstract"] = corpus["Abstract"].apply(lambda x: lemmatize_words(x))
print("Data pre-processed.")

In [None]:
print("Original data: " + text["Abstract"][0])
print("")
print("Pre-processed data: " + corpus["Abstract"][0])

## 3. Exploratory Data Analysis

In [None]:
colors1 = ["#023373"]
colors2 = ["#023373", "#979797"]

In [None]:
def line(df, x, y, z):
    """
    Plot line chart.
    
    df = dataframe 
    x = categorical variable
    y = numerical variable 
    z = color_code
    
    """
    fig = px.line(
        data_frame = df,
        x = x, 
        y = y,
        color = z,
        color_discrete_sequence= colors2,
        template="simple_white",
        width=1000,        
        height=400,
        log_x = False
    )
        
    return fig.show()

In [None]:
def bar_chart(df, x, y):
    """
    Plot bar chart.
    
    df = dataframe
    x = categorical variable
    y = numerical variable 
    
    """
    fig = px.bar(
        data_frame = df,
        x = x,
        y = y,
        color_discrete_sequence=["#03658C"],
        template="simple_white",
        orientation = "v",
        text = y,
        facet_col = "Period",
        facet_col_wrap=2,
        facet_row_spacing = 0.15,        
        width=1000,        
        height=600
    )
    
    fig.update_layout(font=dict(family="Times New Roman",size=12),
                      xaxis={"categoryorder":"category ascending"}
                     )
    
    fig.for_each_annotation(lambda x: x.update(text = x.text.replace("Period=", "")))
    fig.update_yaxes(title_text = "", visible = False)
    fig.update_xaxes(title_text = "Topic")
    
    
    return fig.show()

In [None]:
def barh_chart(df, x, y, title):
    """
    Plot bar chart.
    
    df = dataframe
    x = categorical variable
    y = numerical variable 
    
    """
    fig = px.bar(
        data_frame = df,
        x = x,
        y = y,
        color_discrete_sequence=["#03658C"],
        template="simple_white",
        orientation = "h",
        text = x,
        width=1000,        
        height=800
    )
    
    fig.update_layout(font=dict(family="Times New Roman",size=12),
                      yaxis={'categoryorder':'max ascending'},
                      title={
                          'text': title,
                          'y':0.98,
                          'x':0.5,
                          'xanchor': 'center',
                          'yanchor': 'top'}
                     )
    
    fig.update_yaxes(title_text = "")
    fig.update_xaxes(title_text = "", visible = False)
    
    
    return fig.show()

In [None]:
def histogram(df, x, n_bins, title): 
    """
    Plot histogram.
    
    df = dataframe
    x = categorical variable
    n_bins = bin size 
    
    """
    fig = px.histogram(
        data_frame=df,
        x = x,
        marginal= "box",
        color_discrete_sequence=["#03658C"],
        nbins=n_bins,
        template="simple_white",
        width=1000,        
        height=500
    )
    
    fig.update_layout(
        font_family="Times New Roman", 
        title={
            'text': title,
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'}
    )

    
    return fig.show()

In [None]:
def scatter_plot(df):
    """
    Plot scatter plot.
    
    df = dataframe with x and y coordinates 
    
    """
    fig = px.scatter(
        data_frame = df,
        #color = topics,
        opacity = 0.5,
        title = None,
        color_discrete_sequence=["#023373", "#03588C", "#03658C", "#6CBAD9", "#F2F2F2" ],
        template="simple_white",
        orientation = "h",
        width=1000,        
        height=600
    )
    
    fig.update_layout(
        font_family="Times New Roman" 
    )
    
    return fig.show() 

In [None]:
def area(df, x, y, z):  
    """
    Plot area chart.
    
    df = dataframe 
    x = categorical variable
    y = numerical variable 
    z = topics
    
    """
    fig = px.area(
        data_frame = df,
        x = x,
        y = y,
        template = "simple_white",
        color_discrete_sequence = colors1,
        facet_col = z,
        facet_col_wrap=2,
        facet_row_spacing = 0.1,
        height = 1000,
        width = 1000
    )
    
    fig.update_layout(
        font_family="Times New Roman",
        showlegend=False
    )
    
    fig.for_each_annotation(lambda x: x.update(text = x.text.replace("variable=", "")))
    
    fig.update_xaxes(title_text = "")
    fig.update_yaxes(title_text = "")
    
    return fig.show()

In [None]:
def plot_top_words(model, feature_names, n_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        fig = px.bar(
            x = weights,
            y = top_features,
            text =np.round(weights, 2),
            orientation = "h",
            color_discrete_sequence=["#03658C"],
            template = "simple_white", 
            #title="Long-Form Input",
            title = "Topic " + str(topic_idx),
            height = 400,
            width = 400

        )



        fig.update_layout(
            font_family="Times New Roman",
            showlegend=False,
            yaxis={'categoryorder':'max ascending'},
            title={
                'y':0.9,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'}
        )
        
        fig.for_each_annotation(lambda x: x.update(text = x.text.replace("topic=", " ")))
        fig.update_yaxes(title_text = "")
        fig.update_xaxes(title_text = "", visible = False)
        
        fig.show()


In [None]:
def fig_2d(df, x, y, document, color, topic_order):
    fig = px.scatter(
        df, 
        x=x, 
        y=y,
        color = color, labels={"topic": "Themen"},
        hover_name = document,
        opacity = 0.85,
        template="simple_white",
        width=800,        
        height=500,
        color_discrete_sequence=["red", "green", "blue", "orange", "goldenrod", "magenta"],
        category_orders= topic_order
    )
    
    fig.update_layout(showlegend=True)


    fig.update_xaxes(visible = False)
    fig.update_yaxes(visible = False)

    return fig.show()       

In [None]:
def fig_3d(df, x, y, z, color):
    fig = px.scatter_3d(
        df, 
        x=x, 
        y=y,
        z=z,
        color = color,
        #hover_name = "Topic",
        opacity = 0.85,
        template="simple_white",
        width=1000,        
        height=800,
        color_continuous_scale = colors2
    )
    
    fig.update_layout(showlegend=False)


    fig.update_xaxes(visible = False)
    fig.update_yaxes(visible = False)

    return fig.show()    

In [None]:
def tm_dist_period(df):
    df = df.copy()
    df = df.groupby(["Year", "Topic"]).size()
    df = pd.DataFrame(df).reset_index().rename(columns={0:"Count"})

    df["Year"] = df["Year"].astype(int).astype(str)
    df["Topic"] = df["Topic"].astype(str)

    conditions = [
        (df["Year"] <= "1997"),
        (df["Year"] >= "1998") & (df["Year"] <= "2005"),
        (df["Year"] >= "2006") & (df["Year"] <= "2013"),
        (df["Year"] >= "2014")
    ]

    titles = ["a) zwischen 1977 und 1997",
              "b) zwischen 1998 und 2005", 
              "c) zwischen 2006 und 2013",
              "d) zwischen 2014 und 2020"]

    df["Period"] = np.select(conditions, titles)
    df = df.groupby(by = ["Period", "Topic"]).agg(Count = ("Count", np.sum)).reset_index()
    
    fig = bar_chart(df, df["Topic"].astype("int64"), df["Count"])
    
    return fig

In [None]:
def topic_distribution(df):
    topic_count = pd.DataFrame()

    topic_count["Topic"] = df.Topic
    topic_count = topic_count.groupby("Topic").agg(
        Total_Documents = ("Topic", np.size),
        Proportion = ("Topic", np.size))

    topic_count["Proportion"] = topic_count["Proportion"].apply(lambda x: round((x * 100) / len(corpus), 2))

    return topic_count.reset_index()

In [None]:
histogram(text, "Cited by", 150, "Histogramm der Anzahl der Zitationen")

In [None]:
top_journals = (text.groupby(by = ["Source title"]).size().reset_index(name="Totel Documents"))
top_journals = top_journals.sort_values("Totel Documents", ascending=False)

top_journals = top_journals[:15]
barh_chart(top_journals, y = "Source title", x = "Totel Documents", title = "Top 15 Journals")

In [None]:
histogram(text, "Year", (len(text["Year"].unique())), "Histogramm der Veröffentlichungsjahre")

In [None]:
corpus["word_count"] = corpus["Abstract"].apply(lambda x: len(str(x).split()))
histogram(corpus, "word_count", 30, "Histogramm der Anzahl der Wörter in einem Abstract")

In [None]:
top_words = pd.Series(' '.join(corpus["Abstract"]).split()).value_counts()[:30]
top_words = pd.DataFrame(top_words, columns = ["count"])

barh_chart(top_words, "count", top_words.index, "Top 30 Wörter im Korpus")

In [None]:
def get_top_n_gram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    
    sum_words = bag_of_words.sum(axis=0).round(2)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

top_words = get_top_n_gram(corpus["Abstract"], 30)
df3 = pd.DataFrame(top_words, columns = ["n_gram", "count"])

barh_chart(df3, "count", "n_gram", "Top 30 Wörter im Korpus")

In [None]:
def topic_popularity(model, df):
    topic_distribution = df.iloc[:, 5:]
    topic_distribution["Year"] = df.Year
    
    ytime = np.unique(df['Year'])
    
    topic_distributions_by_year = np.zeros([len(ytime), model.n_components])

    topic_distribution_by_year = topic_distribution.groupby(by = "Year").sum()
    topic_distribution_by_year = topic_distribution_by_year / np.sum(topic_distribution_by_year)
    topic_distribution_by_year = topic_distribution_by_year.reset_index().melt(id_vars = "Year")
    
    return topic_distribution_by_year

## 4. Model parameter

In [None]:
n_features = None
n_top_words = 10

k_min = 3
k_max = 30 + 1
k_step = 1

max_df = 0.95
min_df = 2


## 5. Calculate Topic Coherence Cv

In [None]:
%%notify
def gensim_data(corpus):
    from gensim.models.coherencemodel import CoherenceModel
    from gensim.corpora.dictionary import Dictionary
   
    gensim_data = [i.split( ) for i in corpus]
    gensim_dict = Dictionary(gensim_data)

    gensim_dict.filter_extremes(no_below=3, no_above=0.95, keep_n=5000)

    gensim_corpus = [gensim_dict.doc2bow(text) for text in gensim_data]
   
    return gensim_data, gensim_dict, gensim_corpus
gensim_data, gensim_dict, gensim_corpus = gensim_data(corpus["Abstract"])


In [None]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.nmf import Nmf
from operator import itemgetter

coherence = []
n_topic = list(np.arange(k_min, k_max, k_step))
for i in n_topic:
    gensim_nmfModel = Nmf(
        corpus=gensim_corpus,
        num_topics=i,
        id2word=gensim_dict,
        chunksize=2000,
        passes=5,
        kappa=.1,
        minimum_probability=0.01,
        w_max_iter=300,
        w_stop_condition=0.0001,
        h_max_iter=100,
        h_stop_condition=0.001,
        eval_every=10,
        normalize=True,
        random_state=1
    )
  
    cm = CoherenceModel(
        model=gensim_nmfModel,
        texts=gensim_data,
        dictionary=gensim_dict,
        coherence= "c_v"
    )
  
    coherence.append((cm.get_coherence(), i, "NMF"))

In [None]:
nmf_coherence_values = pd.DataFrame(coherence, columns = ["Kohärenzwert Cv", "Anzahl der Themen", "Algorithmus"])
plot_coherence = line(nmf_coherence_values, "Anzahl der Themen", "Kohärenzwert Cv", None)

In [None]:
%%notify
from gensim.models.ldamodel import LdaModel

coherence = []
n_topic = list(np.arange(k_min, k_max, k_step))
for i in n_topic:
    gensim_Model = LdaModel(
    corpus=gensim_corpus,
    id2word=gensim_dict,
    num_topics=i,
    eta = 0.1,
    alpha = 50/i,
    random_state=1,
    chunksize=200,
    passes=1000,
    per_word_topics=True
    )

    cm = CoherenceModel(
        model=gensim_Model,
        texts=gensim_data,
        dictionary=gensim_dict,
        coherence= "c_v"
    )
    
    coherence.append((cm.get_coherence(), i, "LDA"))

In [None]:
lda_coherence_values = pd.DataFrame(coherence, columns = ["Kohärenzwert Cv", "Anzahl der Themen", "Algorithmus"])
plot_coherence = line(lda_coherence_values, "Anzahl der Themen", "Kohärenzwert Cv", None)

In [None]:
coherence_values = pd.concat([nmf_coherence_values, lda_coherence_values], join="inner")
line(coherence_values, "Anzahl der Themen", "Kohärenzwert Cv", "Algorithmus")

In [None]:
coherence_values.drop("Anzahl der Themen", axis = 1).groupby("Algorithmus").describe()

In [None]:
print(stats.shapiro(nmf_coherence_values["Kohärenzwert Cv"]))
print(stats.shapiro(lda_coherence_values["Kohärenzwert Cv"]))

In [None]:
stats.levene(nmf_coherence_values["Kohärenzwert Cv"], lda_coherence_values["Kohärenzwert Cv"])

In [None]:
res = pg.ttest(nmf_coherence_values["Kohärenzwert Cv"], lda_coherence_values["Kohärenzwert Cv"], alternative = "greater", confidence = 0.99, correction=True)
display(res)

## 6. Latent Dirichlet Allocation 

In [None]:
count_vectorizer = CountVectorizer(max_df=max_df, 
                                   min_df=min_df, 
                                   analyzer='word',
                                   ngram_range=(1,2), 
                                   max_features=n_features
                                  )


In [None]:
bow_dtm = count_vectorizer.fit_transform(corpus["Abstract"])


In [None]:
lda_tm = LatentDirichletAllocation(
    n_components=6,
    doc_topic_prior = (50/6),
    topic_word_prior = 0.1,
    max_iter=1000, 
    learning_method='online',   
    random_state=1,
    batch_size=128, 
    evaluate_every = -1,
    n_jobs = -1
)


In [None]:
lda_output = lda_tm.fit_transform(bow_dtm)

In [None]:
lda_df = pd.DataFrame(
    {"Topic" : np.argmax(lda_output, axis = 1), 
     "Terms" : corpus["Abstract"], 
     "Keywords" : corpus["Index Keywords"], 
     "Title" : corpus["Title"], 
     "DOI" : corpus["DOI"], 
     "Year" : corpus["Year"], 
     "Journal" : corpus["Source title"]}
)


In [None]:
lda_df = lda_df.reset_index()

topic_names = ["Topic " + str(i) for i in range(lda_tm.n_components)]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topic_names)
lda_df = lda_df.join(df_document_topic)

lda_df


In [None]:
topic_distribution(lda_df)

In [None]:
feature_names = count_vectorizer.get_feature_names()

plot_top_words(lda_tm, feature_names, 10)

In [None]:
area(topic_popularity(lda_tm, lda_df), "Year", "value", "variable")

In [None]:
tm_dist_period(lda_df)


## 7. Non-negative Matrix Factorization Topic Modell 

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df = max_df, 
                                   min_df = min_df,
                                   max_features = n_features, 
                                   ngram_range = (1,2)
                                  )


In [None]:
tfidf_dtm = tfidf_vectorizer.fit_transform(corpus["Abstract"])

In [None]:
nmf_tm = NMF(n_components = 6, 
             random_state=1,
             alpha=.1,
             solver = "mu",
             max_iter = 1000,
             beta_loss="frobenius",
             l1_ratio=.5
            )


In [None]:
nmf_tm = nmf_tm.fit(tfidf_dtm)

In [None]:
V = tfidf_vectorizer.transform(corpus["Abstract"])
W = nmf_tm.components_
H = nmf_tm.fit_transform(V) 


print("Non-negative Matrix Factorization Modell")
print('V = {} x {}'.format(V.shape[0], V.shape[1]))
print('W = {} x {}'.format(W.shape[0], W.shape[1]))
print('H = {} x {}'.format(H.shape[0], H.shape[1]))


In [None]:
nmf_df = pd.DataFrame(
    {"Topic" : np.argmax(H, axis = 1), 
     "Terms" : corpus["Abstract"], 
     "Keywords" : corpus["Author Keywords"], 
     "Title" : corpus["Title"], 
     "DOI" : corpus["DOI"], 
     "Year" : corpus["Year"], 
     "Journal" : corpus["Source title"]})


In [None]:
nmf_df = nmf_df.reset_index()

topic_names = ["Topic " + str(i) for i in range(nmf_tm.n_components)]

df_document_topic = pd.DataFrame(np.round(H, 2), columns = topic_names)

nmf_df = nmf_df.join(df_document_topic)
nmf_df


In [None]:
topic_distribution(nmf_df)

In [None]:
feature_names = tfidf_vectorizer.get_feature_names()

plot_top_words(nmf_tm, feature_names, 10)


In [None]:
area(topic_popularity(nmf_tm, nmf_df), "Year", "value", "variable")


In [None]:
tm_dist_period(nmf_df)
