## Literature analysis with non-negative matrix factorization topic model 

Project: Scientific Anaylsis <br>
Documentation: Master Thesis <br>
Author: Tanyel Tunçer <br>
Supervisor: <br> 
Selected Database: <br> <br>
Date of analysis: 
<br>


### Data Preparation 

Step 1: Define a search string <br>
Step 2: Query scientific data based on the defined search string  <br>
Step 3: Export results (max: 2000 entries in Scopus) <br>
Step 4: Import data from local to model
<br>

### Analysis Instructions

Step 1: Import data <br>
Step 2: Pre-process imported data <br>
Step 3: Exploratory data analysis <br>
Step 4: Define model parameter <br>
Step 5: Calculate topic coherence Cv <br>
Step 7: Execute non-negative matrix factorization algoritmn
<br>

### Technical Preparation for model execution 

Install following version before importing libaries <br>
Note: All results from the thesis are identified based on the listed libary versions, other version can result in different result <br> <br>
pandas: 1.2.4  <br>
numpy: 1.19.2 <br>
nltk: 3.7 <br>
plotly: 5.8.0 

In [1]:
import pandas as pd 
import numpy as np
import re
import os

import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

import plotly.express as px

from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import Counter

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

pd.set_option("display.max_columns", None)
pd.get_option("display.max_rows")


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tanye\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tanye\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\tanye\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\tanye\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


60

## 1. Data

In [2]:
import os

# get current working directory
cwd = os.getcwd()

#get files in directory
files = os.listdir(cwd) 

print(files)

['.ipynb_checkpoints', 'literature_analysis_tanyel_tuncer.ipynb', 'research.xlsx', 'scopus.csv']


In [3]:
#Change data path. Data will be automatically imported from the given data path. 
text = pd.read_csv("C:/Users/tanye/Desktop/Literature Review/Literature/scopus.csv", sep = ",")
text.head()

Unnamed: 0,Authors,Author full names,Author(s) ID,Titles,Year,Source title,DOI,Cited by,Link,Abstract,Author Keywords,Document Type,Source
0,Petrakis E.; Vlassis M.,"Petrakis, Emmanuel (57208174300); Vlassis, Min...",57208174300; 56001698900,Endogenous scope of bargaining in a union-olig...,2000,Labour Economics,10.1016/S0927-5371(99)00043-3,40,https://www.scopus.com/inward/record.uri?eid=2...,The scope of firm-union bargaining is shown to...,Decentralized bargaining; J50; J53; L13; Oligo...,Article,Scopus
1,Cristóbal J.R.S.,"Cristóbal, José Ramón San (55262675100)",55262675100,The use of game theory to solve conflicts in t...,2015,International Journal of Information Systems a...,10.12821/ijispm030203,23,https://www.scopus.com/inward/record.uri?eid=2...,A typical construction project involves a wide...,Conflict; Game Theory; Negotiation; Project ma...,Article,Scopus
2,Sætersdal H.; Johannessen J.-A.,"Sætersdal, Helene (57188838425); Johannessen, ...",57188838425; 7101812429,The future of HR: Understanding knowledge mana...,2019,The Future of HR: Understanding Knowledge Mana...,,0,https://www.scopus.com/inward/record.uri?eid=2...,HR departments are in transition. From 1980 to...,,Book,Scopus
3,Chong V.K.; Loy C.Y.; Wang I.Z.; Woodliff D.R.,"Chong, Vincent K. (7004645694); Loy, Chanel Y....",7004645694; 56736045500; 57199509029; 6506920897,"The effect of negotiators’ role, leadership to...",2021,Journal of Management Control,10.1007/s00187-021-00321-8,0,https://www.scopus.com/inward/record.uri?eid=2...,This study examines the effect of negotiators’...,And negotiated transfer prices; Leadership ton...,Article,Scopus
4,Yeboah-Assiamah E.; Gyekye-Jandoh M.A.A.; Asam...,"Yeboah-Assiamah, Emmanuel (56527124800); Gyeky...",56527124800; 56690387200; 56950997000; 2346847...,"Henceforth, We Will Never Walk Alone: Empirica...",2022,Systemic Practice and Action Research,10.1007/s11213-021-09572-x,1,https://www.scopus.com/inward/record.uri?eid=2...,Moving beyond the ‘great man’ view of the orga...,Cases; Flexibility; Leadership; Networks; Tran...,Article,Scopus


In [4]:
text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 476 entries, 0 to 475
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Authors            470 non-null    object
 1   Author full names  470 non-null    object
 2   Author(s) ID       470 non-null    object
 3   Titles             476 non-null    object
 4   Year               476 non-null    int64 
 5   Source title       476 non-null    object
 6   DOI                444 non-null    object
 7   Cited by           476 non-null    int64 
 8   Link               476 non-null    object
 9   Abstract           476 non-null    object
 10  Author Keywords    290 non-null    object
 11  Document Type      476 non-null    object
 12  Source             476 non-null    object
dtypes: int64(2), object(11)
memory usage: 48.5+ KB


In [5]:
text.isnull().sum()

Authors                6
Author full names      6
Author(s) ID           6
Titles                 0
Year                   0
Source title           0
DOI                   32
Cited by               0
Link                   0
Abstract               0
Author Keywords      186
Document Type          0
Source                 0
dtype: int64

In [6]:
text["Cited by"] = text["Cited by"].fillna(0)

text.drop(text[text["Abstract"]=="[No abstract available]"].index, inplace=True)
#text = text[text["Cited by"] >= 150]

text.head()

Unnamed: 0,Authors,Author full names,Author(s) ID,Titles,Year,Source title,DOI,Cited by,Link,Abstract,Author Keywords,Document Type,Source
0,Petrakis E.; Vlassis M.,"Petrakis, Emmanuel (57208174300); Vlassis, Min...",57208174300; 56001698900,Endogenous scope of bargaining in a union-olig...,2000,Labour Economics,10.1016/S0927-5371(99)00043-3,40,https://www.scopus.com/inward/record.uri?eid=2...,The scope of firm-union bargaining is shown to...,Decentralized bargaining; J50; J53; L13; Oligo...,Article,Scopus
1,Cristóbal J.R.S.,"Cristóbal, José Ramón San (55262675100)",55262675100,The use of game theory to solve conflicts in t...,2015,International Journal of Information Systems a...,10.12821/ijispm030203,23,https://www.scopus.com/inward/record.uri?eid=2...,A typical construction project involves a wide...,Conflict; Game Theory; Negotiation; Project ma...,Article,Scopus
2,Sætersdal H.; Johannessen J.-A.,"Sætersdal, Helene (57188838425); Johannessen, ...",57188838425; 7101812429,The future of HR: Understanding knowledge mana...,2019,The Future of HR: Understanding Knowledge Mana...,,0,https://www.scopus.com/inward/record.uri?eid=2...,HR departments are in transition. From 1980 to...,,Book,Scopus
3,Chong V.K.; Loy C.Y.; Wang I.Z.; Woodliff D.R.,"Chong, Vincent K. (7004645694); Loy, Chanel Y....",7004645694; 56736045500; 57199509029; 6506920897,"The effect of negotiators’ role, leadership to...",2021,Journal of Management Control,10.1007/s00187-021-00321-8,0,https://www.scopus.com/inward/record.uri?eid=2...,This study examines the effect of negotiators’...,And negotiated transfer prices; Leadership ton...,Article,Scopus
4,Yeboah-Assiamah E.; Gyekye-Jandoh M.A.A.; Asam...,"Yeboah-Assiamah, Emmanuel (56527124800); Gyeky...",56527124800; 56690387200; 56950997000; 2346847...,"Henceforth, We Will Never Walk Alone: Empirica...",2022,Systemic Practice and Action Research,10.1007/s11213-021-09572-x,1,https://www.scopus.com/inward/record.uri?eid=2...,Moving beyond the ‘great man’ view of the orga...,Cases; Flexibility; Leadership; Networks; Tran...,Article,Scopus


## 2. Data pre-processing

In [7]:
corpus = text.copy()

def remove_stopwords(tokens):
    text = [x for x in tokens if x not in nltk.corpus.stopwords.words("english")]
    return text

def remove_punctuation(tokens):
    import string
    text = " ".join([x for x in tokens if x not in string.punctuation])
    text = " ".join([x for x in tokens if x not in string.digits])
    return text

def remove_words(tokens):
    words = ["δ18o", "paper"]
    text = [x for x in tokens if x not in words]
    text = [x for x in text if len(x) >= 3]
    #text = [x for x in text if x != ","]
    return text

    
def get_pos_tag(pos_tag):
    if pos_tag in ["JJ", "JJR", "JJS"]:
        return wordnet.ADJ
    elif pos_tag in ["RB", "RBR", "RBS"]:
        return wordnet.ADV
    elif pos_tag in ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]:
        return wordnet.VERB
    else: 
        return wordnet.NOUN

def lemmatize_words(tokens):
    pos_tags = nltk.pos_tag(tokens)
    words = " ".join([WordNetLemmatizer().lemmatize(i) for i in tokens])
    return words

def extract_noun (tokens):
    pos_tags = nltk.pos_tag(tokens)
    tags = ["NN", "NNS", "NNP", "NNPS"]
    words = [i[0] for i in pos_tags if i[1] in tags]
    return words
                      

In [None]:
corpus["Abstract"] = corpus["Abstract"].map(lambda x: x.lower())
corpus["Abstract"] = corpus["Abstract"].map(lambda x: RegexpTokenizer(r"\w+").tokenize(x))
corpus["Abstract"] = corpus["Abstract"].apply(lambda x: remove_words(x))
corpus["Abstract"] = corpus["Abstract"].map(lambda x: remove_stopwords(x))
corpus["Abstract"] = corpus["Abstract"].apply(lambda x: extract_noun(x))
corpus["Abstract"] = corpus["Abstract"].apply(lambda x: lemmatize_words(x))
print("Data pre-processed.")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\Users\tanye\miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-f1fce59dd334>", line 4, in <module>
    corpus["Abstract"] = corpus["Abstract"].map(lambda x: remove_stopwords(x))
  File "C:\Users\tanye\miniconda3\lib\site-packages\pandas\core\series.py", line 3909, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "C:\Users\tanye\miniconda3\lib\site-packages\pandas\core\base.py", line 937, in _map_values
    new_values = map_f(values, mapper)
  File "pandas\_libs\lib.pyx", line 2467, in pandas._libs.lib.map_infer
  File "<ipython-input-8-f1fce59dd334>", line 4, in <lambda>
    corpus["Abstract"] = corpus["Abstract"].map(lambda x: remove_stopwords(x))
  File "<ipython-input-7-498db1f91639>", line 4, in remove_stopwords
    text = [x for x in tokens if x not in nltk.corpus.stopwords.words("english")]
  F

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\Users\tanye\miniconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-f1fce59dd334>", line 4, in <module>
    corpus["Abstract"] = corpus["Abstract"].map(lambda x: remove_stopwords(x))
  File "C:\Users\tanye\miniconda3\lib\site-packages\pandas\core\series.py", line 3909, in map
    new_values = super()._map_values(arg, na_action=na_action)
  File "C:\Users\tanye\miniconda3\lib\site-packages\pandas\core\base.py", line 937, in _map_values
    new_values = map_f(values, mapper)
  File "pandas\_libs\lib.pyx", line 2467, in pandas._libs.lib.map_infer
  File "<ipython-input-8-f1fce59dd334>", line 4, in <lambda>
    corpus["Abstract"] = corpus["Abstract"].map(lambda x: remove_stopwords(x))
  File "<ipython-input-7-498db1f91639>", line 4, in remove_stopwords
    text = [x for x in tokens if x not in nltk.corpus.stopwords.words("english")]
  F

In [8]:
print("Original data: " + text["Abstract"][0])
print("")
print("Pre-processed data: " + corpus["Abstract"][0])

Original data: The scope of firm-union bargaining is shown to be endogenously determined in a union-oligopoly model with decentralized negotiations. If the unions' power is sufficiently high, all bargaining units choose to negotiate over wages alone, i.e., universal right-to-manage bargaining emerges in equilibrium. Otherwise, wage/employment bargaining and right-to-manage bargaining coexist in the same industry. In equilibrium, some firm-union pairs will always choose to bargain over employment as well, since the firms become Stackelberg leaders in the market by committing to a particular output during the negotiations. The firms and their unions both benefit from the additional Stackelberg rents, provided that the unions' power is small enough. Our analysis suggests that there is not necessarily a negative relationship between unions' power and sectoral employment rates.

ERROR! Session/line number was not unique in database. History logging moved to new session 145


TypeError: can only concatenate str (not "list") to str

## 3. Exploratory Data Analysis

In [None]:
colors1 = ["#023373"]
colors2 = ["#023373", "#979797"]

In [None]:
def line(df, x, y, z):
    """
    Plot line chart.
    
    df = dataframe 
    x = categorical variable
    y = numerical variable 
    z = color_code
    
    """
    fig = px.line(
        data_frame = df,
        x = x, 
        y = y,
        color = z,
        color_discrete_sequence= colors2,
        template="simple_white",
        width=1000,        
        height=400,
        log_x = False
    )
        
    return fig.show()

In [None]:
def bar_chart(df, x, y):
    """
    Plot bar chart.
    
    df = dataframe
    x = categorical variable
    y = numerical variable 
    
    """
    fig = px.bar(
        data_frame = df,
        x = x,
        y = y,
        color_discrete_sequence=["#03658C"],
        template="simple_white",
        orientation = "v",
        text = y,
        facet_col = "Period",
        facet_col_wrap=2,
        facet_row_spacing = 0.15,        
        width=1000,        
        height=600
    )
    
    fig.update_layout(font=dict(family="Times New Roman",size=12),
                      xaxis={"categoryorder":"category ascending"}
                     )
    
    fig.for_each_annotation(lambda x: x.update(text = x.text.replace("Period=", "")))
    fig.update_yaxes(title_text = "", visible = False)
    fig.update_xaxes(title_text = "Topic")
    
    
    return fig.show()

In [None]:
def barh_chart(df, x, y, title):
    """
    Plot bar chart.
    
    df = dataframe
    x = categorical variable
    y = numerical variable 
    
    """
    fig = px.bar(
        data_frame = df,
        x = x,
        y = y,
        color_discrete_sequence=["#03658C"],
        template="simple_white",
        orientation = "h",
        text = x,
        width=1000,        
        height=800
    )
    
    fig.update_layout(font=dict(family="Times New Roman",size=12),
                      yaxis={'categoryorder':'max ascending'},
                      title={
                          'text': title,
                          'y':0.98,
                          'x':0.5,
                          'xanchor': 'center',
                          'yanchor': 'top'}
                     )
    
    fig.update_yaxes(title_text = "")
    fig.update_xaxes(title_text = "", visible = False)
    
    
    return fig.show()

In [None]:
def histogram(df, x, n_bins, title): 
    """
    Plot histogram.
    
    df = dataframe
    x = categorical variable
    n_bins = bin size 
    
    """
    fig = px.histogram(
        data_frame=df,
        x = x,
        marginal= "box",
        color_discrete_sequence=["#03658C"],
        nbins=n_bins,
        template="simple_white",
        width=1000,        
        height=500
    )
    
    fig.update_layout(
        font_family="Times New Roman", 
        title={
            'text': title,
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'}
    )

    
    return fig.show()

In [None]:
def scatter_plot(df):
    """
    Plot scatter plot.
    
    df = dataframe with x and y coordinates 
    
    """
    fig = px.scatter(
        data_frame = df,
        #color = topics,
        opacity = 0.5,
        title = None,
        color_discrete_sequence=["#023373", "#03588C", "#03658C", "#6CBAD9", "#F2F2F2" ],
        template="simple_white",
        orientation = "h",
        width=1000,        
        height=600
    )
    
    fig.update_layout(
        font_family="Times New Roman" 
    )
    
    return fig.show() 

In [None]:
def area(df, x, y, z):  
    """
    Plot area chart.
    
    df = dataframe 
    x = categorical variable
    y = numerical variable 
    z = topics
    
    """
    fig = px.area(
        data_frame = df,
        x = x,
        y = y,
        template = "simple_white",
        color_discrete_sequence = colors1,
        facet_col = z,
        facet_col_wrap=2,
        facet_row_spacing = 0.1,
        height = 1000,
        width = 1000
    )
    
    fig.update_layout(
        font_family="Times New Roman",
        showlegend=False
    )
    
    fig.for_each_annotation(lambda x: x.update(text = x.text.replace("variable=", "")))
    
    fig.update_xaxes(title_text = "")
    fig.update_yaxes(title_text = "")
    
    return fig.show()

In [None]:
def plot_top_words(model, feature_names, n_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        fig = px.bar(
            x = weights,
            y = top_features,
            text =np.round(weights, 2),
            orientation = "h",
            color_discrete_sequence=["#03658C"],
            template = "simple_white", 
            #title="Long-Form Input",
            title = "Topic " + str(topic_idx),
            height = 400,
            width = 400

        )



        fig.update_layout(
            font_family="Times New Roman",
            showlegend=False,
            yaxis={'categoryorder':'max ascending'},
            title={
                'y':0.9,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'}
        )
        
        fig.for_each_annotation(lambda x: x.update(text = x.text.replace("topic=", " ")))
        fig.update_yaxes(title_text = "")
        fig.update_xaxes(title_text = "", visible = False)
        
        fig.show()


In [None]:
def fig_2d(df, x, y, document, color, topic_order):
    fig = px.scatter(
        df, 
        x=x, 
        y=y,
        color = color, labels={"topic": "Themen"},
        hover_name = document,
        opacity = 0.85,
        template="simple_white",
        width=800,        
        height=500,
        color_discrete_sequence=["red", "green", "blue", "orange", "goldenrod", "magenta"],
        category_orders= topic_order
    )
    
    fig.update_layout(showlegend=True)


    fig.update_xaxes(visible = False)
    fig.update_yaxes(visible = False)

    return fig.show()       

In [None]:
def fig_3d(df, x, y, z, color):
    fig = px.scatter_3d(
        df, 
        x=x, 
        y=y,
        z=z,
        color = color,
        #hover_name = "Topic",
        opacity = 0.85,
        template="simple_white",
        width=1000,        
        height=800,
        color_continuous_scale = colors2
    )
    
    fig.update_layout(showlegend=False)


    fig.update_xaxes(visible = False)
    fig.update_yaxes(visible = False)

    return fig.show()    

In [None]:
def tm_dist_period(df):
    df = df.copy()
    df = df.groupby(["Year", "Topic"]).size()
    df = pd.DataFrame(df).reset_index().rename(columns={0:"Count"})

    df["Year"] = df["Year"].astype(int).astype(str)
    df["Topic"] = df["Topic"].astype(str)

    conditions = [
        (df["Year"] <= "1997"),
        (df["Year"] >= "1998") & (df["Year"] <= "2005"),
        (df["Year"] >= "2006") & (df["Year"] <= "2013"),
        (df["Year"] >= "2014")
    ]

    titles = ["a) zwischen 1977 und 1997",
              "b) zwischen 1998 und 2005", 
              "c) zwischen 2006 und 2013",
              "d) zwischen 2014 und 2020"]

    df["Period"] = np.select(conditions, titles)
    df = df.groupby(by = ["Period", "Topic"]).agg(Count = ("Count", np.sum)).reset_index()
    
    fig = bar_chart(df, df["Topic"].astype("int64"), df["Count"])
    
    return fig

In [None]:
def topic_distribution(df):
    topic_count = pd.DataFrame()

    topic_count["Topic"] = df.Topic
    topic_count = topic_count.groupby("Topic").agg(
        Total_Documents = ("Topic", np.size),
        Proportion = ("Topic", np.size))

    topic_count["Proportion"] = topic_count["Proportion"].apply(lambda x: round((x * 100) / len(corpus), 2))

    return topic_count.reset_index()

In [None]:
histogram(text, "Cited by", 150, "Histogramm der Anzahl der Zitationen")

In [None]:
#top_journals = (text.groupby(by = ["Source title"]).size().reset_index(name="Totel Documents"))
#top_journals = top_journals.sort_values("Totel Documents", ascending=False)

#top_journals = top_journals[:15]
#barh_chart(top_journals, y = "Source title", x = "Totel Documents", title = "Top 15 Journals")

In [None]:
histogram(text, "Year", (len(text["Year"].unique())), "Histogramm der Veröffentlichungsjahre")

In [None]:
corpus["word_count"] = corpus["Abstract"].apply(lambda x: len(str(x).split()))
histogram(corpus, "word_count", 30, "Histogramm der Anzahl der Wörter in einem Abstract")

In [None]:
top_words = pd.Series(' '.join(corpus["Abstract"]).split()).value_counts()[:30]
top_words = pd.DataFrame(top_words, columns = ["count"])

barh_chart(top_words, "count", top_words.index, "Top 30 Wörter im Korpus")

In [70]:
def get_top_n_gram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    
    sum_words = bag_of_words.sum(axis=0).round(2)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

top_words = get_top_n_gram(corpus["Abstract"], 30)
df3 = pd.DataFrame(top_words, columns = ["n_gram", "count"])

barh_chart(df3, "count", "n_gram", "Top 30 Wörter im Korpus")

In [71]:
def topic_popularity(model, df):
    topic_distribution = df.iloc[:, 5:]
    topic_distribution["Year"] = df.Year
    
    ytime = np.unique(df['Year'])
    
    topic_distributions_by_year = np.zeros([len(ytime), model.n_components])

    topic_distribution_by_year = topic_distribution.groupby(by = "Year").sum()
    topic_distribution_by_year = topic_distribution_by_year / np.sum(topic_distribution_by_year)
    topic_distribution_by_year = topic_distribution_by_year.reset_index().melt(id_vars = "Year")
    
    return topic_distribution_by_year

## 4. Model parameter

In [72]:
n_features = None
n_top_words = 10

k_min = 3
k_max = 30 + 1
k_step = 1

max_df = 0.95
min_df = 2


## 5. Calculate Topic Coherence Cv

In [73]:
def gensim_data(corpus):
    from gensim.models.coherencemodel import CoherenceModel
    from gensim.corpora.dictionary import Dictionary
   
    gensim_data = [i.split( ) for i in corpus]
    gensim_dict = Dictionary(gensim_data)

    gensim_dict.filter_extremes(no_below=3, no_above=0.95, keep_n=5000)

    gensim_corpus = [gensim_dict.doc2bow(text) for text in gensim_data]
   
    return gensim_data, gensim_dict, gensim_corpus
gensim_data, gensim_dict, gensim_corpus = gensim_data(corpus["Abstract"])


In [74]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.nmf import Nmf
from operator import itemgetter

coherence = []
n_topic = list(np.arange(k_min, k_max, k_step))
for i in n_topic:
    gensim_nmfModel = Nmf(
        corpus=gensim_corpus,
        num_topics=i,
        id2word=gensim_dict,
        chunksize=2000,
        passes=5,
        kappa=.1,
        minimum_probability=0.01,
        w_max_iter=300,
        w_stop_condition=0.0001,
        h_max_iter=100,
        h_stop_condition=0.001,
        eval_every=10,
        normalize=True,
        random_state=1
    )
  
    cm = CoherenceModel(
        model=gensim_nmfModel,
        texts=gensim_data,
        dictionary=gensim_dict,
        coherence= "c_v"
    )
  
    coherence.append((cm.get_coherence(), i, "NMF"))

In [75]:
nmf_coherence_values = pd.DataFrame(coherence, columns = ["Kohärenzwert Cv", "Anzahl der Themen", "Algorithmus"])
plot_coherence = line(nmf_coherence_values, "Anzahl der Themen", "Kohärenzwert Cv", None)

In [76]:
# print(stats.shapiro(nmf_coherence_values["Kohärenzwert Cv"]))


## 7. Non-negative Matrix Factorization Topic Modell 

In [77]:
tfidf_vectorizer = TfidfVectorizer(max_df = max_df, 
                                   min_df = min_df,
                                   max_features = n_features, 
                                   ngram_range = (1,2)
                                  )


In [78]:
tfidf_dtm = tfidf_vectorizer.fit_transform(corpus["Abstract"])

In [79]:
nmf_tm = NMF(n_components = 6, 
             random_state=1,
             alpha=.1,
             solver = "mu",
             max_iter = 1000,
             beta_loss="frobenius",
             l1_ratio=.5
            )


In [80]:
nmf_tm = nmf_tm.fit(tfidf_dtm)


The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).



In [81]:
V = tfidf_vectorizer.transform(corpus["Abstract"])
W = nmf_tm.components_
H = nmf_tm.fit_transform(V) 


print("Non-negative Matrix Factorization Modell")
print('V = {} x {}'.format(V.shape[0], V.shape[1]))
print('W = {} x {}'.format(W.shape[0], W.shape[1]))
print('H = {} x {}'.format(H.shape[0], H.shape[1]))


Non-negative Matrix Factorization Modell
V = 467 x 4771
W = 6 x 4771
H = 467 x 6


In [82]:
nmf_df = pd.DataFrame(
    {"Topic" : np.argmax(H, axis = 1), 
     "Terms" : corpus["Abstract"], 
     #"Keywords" : corpus["Author Keywords"], 
     "Title" : corpus["Titles"], 
     "DOI" : corpus["DOI"], 
     "Year" : corpus["Year"]}) 
     #"Journal" : corpus["Source title"]})


In [83]:
nmf_df = nmf_df.reset_index()

topic_names = ["Topic " + str(i) for i in range(nmf_tm.n_components)]

df_document_topic = pd.DataFrame(np.round(H, 2), columns = topic_names)

nmf_df = nmf_df.join(df_document_topic)
nmf_df


Unnamed: 0,index,Topic,Terms,Title,DOI,Year,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5
0,0,3,scope firm union union oligopoly model negotia...,Endogenous scope of bargaining in a union-olig...,10.1016/S0927-5371(99)00043-3,2000,0.00,0.00,0.00,0.35,0.00,0.0
1,1,0,construction project range disparate professio...,The use of game theory to solve conflicts in t...,10.12821/ijispm030203,2015,0.10,0.00,0.00,0.00,0.00,0.0
2,2,0,department transition today management functio...,The future of HR: Understanding knowledge mana...,,2019,0.12,0.00,0.00,0.00,0.00,0.0
3,3,4,study examines effect negotiator role seller b...,"The effect of negotiators’ role, leadership to...",10.1007/s00187-021-00321-8,2021,0.04,0.00,0.00,0.00,0.23,0.0
4,4,0,man view organization leader repository knowle...,"Henceforth, We Will Never Walk Alone: Empirica...",10.1007/s11213-021-09572-x,2022,0.14,0.00,0.00,0.00,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
462,471,0,book review article louis cherns editor qualit...,Book Review Section,10.1111/j.1744-6570.1976.tb00413.x,1976,0.09,0.00,0.01,0.00,0.00,0.0
463,472,0,finance minister summit level crisis summit le...,G-20: In search of a role,10.1142/9789814578622_0002,2015,0.06,0.03,0.00,0.00,0.00,0.0
464,473,0,term neighborhood refers patient home provider...,Coordination within medical neighborhoods: Ins...,10.1097/HMR.0000000000000063,2016,0.04,0.00,0.00,0.00,0.00,0.0
465,474,1,aim trade reform trade policy country india pa...,Trade policy in South Asia: Recent liberalisat...,10.1111/1467-9701.00206,1999,0.01,0.17,0.00,0.00,0.00,0.0


In [84]:
topic_distribution(nmf_df)

Unnamed: 0,Topic,Total_Documents,Proportion
0,0,333,71.31
1,1,43,9.21
2,2,21,4.5
3,3,29,6.21
4,4,24,5.14
5,5,17,3.64


In [85]:
feature_names = tfidf_vectorizer.get_feature_names()

plot_top_words(nmf_tm, feature_names, 10)


In [86]:
area(topic_popularity(nmf_tm, nmf_df), "Year", "value", "variable")


In [87]:
tm_dist_period(nmf_df)


In [89]:
nmf_df.to_excel("C:/Users/tanye/Desktop/Literature Review/Literature/research.xlsx")