# DOCUMENTATION

1. Import relevant libraries and download relevant resources
2. Obtain the details for retrieving and stroing the data
3. Get the dataset from mongoDB database and store it as a pandas dataframe. <br/>
4. Reduce the orginial dataframe by removing the columns which are not needed for Topic Modelling. Cureently we are considering only the area and the description of the project as the columns in our dataframe. 
5. Identifying stopwords:
    5.1. Load NLTK's English and German stopwords
    5.2. Add cities and mothns to it 
    5.3. Manually added stopwords (irrelevant words for our analysis)
6. Creation of Stemmer: Creating our own stemmer as a dictionary where we specify how to combine same words
7. Tokenzing and Stemming the description data present in the dataframe.
8. Get frequency distribution of all words present in the dataframe.
9. Choose a threshold frequency for top words(currently 100)
10. Reduce the words in tokenized column to these top words for each row
11. Remove the rows which have less than 10 tokens
12. Create training and test dataset
13. Train LDA model
14. Save the model as 'LDA_Approach_1.model'

### Importing all the relevant libraries 

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import gensim
from gensim.models import LdaModel
from gensim import models, corpora, similarities
import re
import time
from nltk import FreqDist
from scipy.stats import entropy
import matplotlib.pyplot as plt
import seaborn as sns
from pymongo import MongoClient
from nltk.tokenize import word_tokenize
import pickle
import string
from string import punctuation
import os
%matplotlib inline
sns.set_style("darkgrid")

### Set the current directory where this notebook is present as working Directory

In [2]:
currDir = os.getcwd()
if "USL" not in currDir:
    os.chdir(os.path.join(currDir,  "ML", "USL"))

### Downloading relevant resources

In [3]:
#Using NLTK Downloader to obtain the resource stopwords, punkt
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Database details for retrieving dataset and storing the dataset

In [4]:
#Details for retrieving  data from projectfinder
db_loc = {
    'ip' :'10.10.250.0',
    'port' : 27017,
    'database' : 'projectfinder',
    'collection' : 'itproject_clean'
}

In [5]:
#Details for storing data related to projectfinder
db_data = {
    'ip' :'10.10.250.0',
    'port' : 27017,
    'database' : 'projectfinder',
    'collection' : 'mldata1'
}

#%%[markdown]
#Methods for loading the dataset

### Creating a path for the directory containing constants and some self made packages

In [6]:
dataDir = os.path.join(currDir,  "constants")
import constants.load_save_dataset as load_save_DS

In [7]:
df_rawData = load_save_DS.load_dataset_from_mongodb(db_loc)
df_rawData.shape

Data loaded from mongodb itproject_clean collection succesfully


(14059, 25)

In [8]:
def get_required_dataset(original_dataset):
    
    #Select required colunms
    df = original_dataset[['description', 'bereich']]
    df = df[df['description'] != '']
    #df.rename(columns = {'description' : 'project', 'bereich' : 'class'})
    df['project'] = df['description']
    df['label'] = df['bereich']
    df.drop(['description', 'bereich'], axis=1, inplace=True)
    df = df[df['label'] != 'IT/Bauingenieur']
    df = df.drop_duplicates()
    return df

In [9]:
df_preprocessedDataset = get_required_dataset(df_rawData)
df_preprocessedDataset.shape
df_preprocessedDataset.head()

Unnamed: 0,project,label
0,Für einen unserer Kunden aus dem Finanzdienstl...,Infr-Admin-Microsoft
1,Kann Profil leider nicht löschen.,IT/Consulting
2,Business Intelligence Analyst (m/w) - Tableau ...,Data-Sci-BI
3,"Konzeption, Customizing sowie Softwareanpassun...",Infr-Admin-Linux
4,Es sollen mehrere Automatisierungen mit ubot S...,IT/IT


In [10]:
# shuffle the data
df_preprocessedDataset = df_preprocessedDataset.sample(frac=1.0)
df_preprocessedDataset.reset_index(drop=True,inplace=True)
df_preprocessedDataset.head()

Unnamed: 0,project,label
0,Aufgabe: \n* Erstellen von Layouts zur Entwick...,IT-Technical-Dev
1,Folgende Qualifikationen / Aufgabenstellungen ...,Infr-Admin-Database
2,Für unseren Kunden suchen wir aktuell eine/n \...,Infr-Database-Admin
3,Beginn/Dauer: 6 PM mit Option \r\nEinsatzort:...,Dev-Web-Fullstack
4,- Sie begleiten das CRM-System auf der IT-Entw...,IT/ERP / CRM Sy


In [11]:
df_preprocessedDataset.iloc[0,0]

'Aufgabe: \n* Erstellen von Layouts zur Entwicklung neuer Mainboards, Platinen sowie Leiterplatten für Sondermaschinenbau \n \nVoraussetzungen: \n* Erfahrung in der Layouterstellung mit EAGLE \n \nEintrittsdatum: 01.09.2018 \nDauer: 24 Wochen'

### Stopwords generation

In [12]:
# load nltk's German and english stopwords'
dataDir = os.path.join(currDir,  "constants")
with open(os.path.join(dataDir, 'german_stopwords_full.txt'), 'r') as f:
    stopwords_germ = f.read().splitlines()
stopwords_eng = nltk.corpus.stopwords.words('english')
combined_stopWordsPath = os.path.join(dataDir, 'stopwords_manual.txt')

In [13]:
#german cities
from constants.bundeslander import Baden_Württemberg, Bayern, Berlin, Brandenburg, Bremen, Hamburg, Hessen, Mecklenburg_Vorpommern, Niedersachsen, Nordrhein_Westfalen, Rheinland_Pfalz, Saarland, Sachsen, Sachsen_Anhalt, Schleswig_Holstein, Thüringen, Ausland

All = Baden_Württemberg + Bayern + Berlin + Brandenburg + Bremen +Hamburg + Hessen + Mecklenburg_Vorpommern + Niedersachsen + Nordrhein_Westfalen + Rheinland_Pfalz + Saarland + Sachsen + Sachsen_Anhalt + Schleswig_Holstein + Thüringen + Ausland
cities = list(set([city.lower() for city in All]))

In [14]:
months = ['Januar', 'January','Februar', 'February', 'März', 'March', 'April', 'Mai', 'May', 'Juni', 'June', 'Juli', 
          'July', 'August', 'September', 'Oktober', 'October', 'November', 'Dezember', 'December']
months = [month.lower() for month in months]
print(months)

['januar', 'january', 'februar', 'february', 'märz', 'march', 'april', 'mai', 'may', 'juni', 'june', 'juli', 'july', 'august', 'september', 'oktober', 'october', 'november', 'dezember', 'december']


In [15]:
stopwords_manual = [line.rstrip('\n') for line in open(combined_stopWordsPath)]
print(len(stopwords_manual))

844


In [16]:
stopwords_all = list(set(stopwords_germ + stopwords_eng + stopwords_manual + cities + months))
len(stopwords_all)

13240

In [17]:
stopwords_add = []
stopwords_add = list(set(stopwords_add + stopwords_manual))
checker = list(set(stopwords_germ + stopwords_eng + cities + months))
stopwords_add.sort()
with open(combined_stopWordsPath, 'w') as f:
    for item in stopwords_add:
        if item not in checker:
            f.write("%s\n" % item)

In [18]:
stopwords_manual = [line.rstrip('\n') for line in open(os.path.join(dataDir, combined_stopWordsPath))]
print(len(stopwords_manual))

stopwords_all = list(set(stopwords_germ + stopwords_eng + stopwords_manual + cities + months))
len(stopwords_all)

844


13240

In [19]:
stemmerLocation = os.path.join(dataDir, "stemmer_own.pickle")
pickle_in = open(stemmerLocation,"rb")
stemmer_own = pickle.load(pickle_in)

### Perform tokenization, stemming on the text

In [20]:
def text_processing(text):
    """Normalize, tokenize, stem the original text string
    
    Args:
    text: string. String containing message for processing
       
    Returns:
    cleaned: list of strings. List containing normalized and stemmed word tokens with bigrams
    """

    try:
        text = re.sub(r'(\d)',' ',text.lower())
        text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
        tokens = word_tokenize(text)
        tokens_cleaned = [word for word in tokens if word not in stopwords_all and len(word) > 1]
        
        stemmed_tokens = []
        stemmer_keys = list(stemmer_own.keys())
        for word in tokens_cleaned:
            for stemmer_key in stemmer_keys:
                if stemmer_key in word:
                    stemmed_word = stemmer_own[stemmer_key]
                    stemmed_tokens.append(stemmed_word)
                    break
            else:
                stemmed_tokens.append(word)
  
                

    except IndexError:
        pass

    return stemmed_tokens

In [21]:
# Clean text and title and create new column "tokenized"
t1 = time.time()
df_preprocessedDataset['token_stem_spRm'] = df_preprocessedDataset['project'].apply(text_processing)
t2 = time.time()

In [22]:
print("Time taken to prepare", len(df_preprocessedDataset), "projects documents:", (t2-t1)/60, "min")

Time taken to prepare 12168 projects documents: 9.076289677619934 min


In [23]:
df_preprocessedDataset.head()

Unnamed: 0,project,label,token_stem_spRm
0,Aufgabe: \n* Erstellen von Layouts zur Entwick...,IT-Technical-Dev,"[layouts, entwicklung, mainboards, platinen, l..."
1,Folgende Qualifikationen / Aufgabenstellungen ...,Infr-Admin-Database,"[aufgabenstellungen, tiefes, entwicklung, db, ..."
2,Für unseren Kunden suchen wir aktuell eine/n \...,Infr-Database-Admin,"[architekt, lamp, informatik, arbeitszeit, pro..."
3,Beginn/Dauer: 6 PM mit Option \r\nEinsatzort:...,Dev-Web-Fullstack,"[pm, entwicklung, testvorgehens, concept, sctm..."
4,- Sie begleiten das CRM-System auf der IT-Entw...,IT/ERP / CRM Sy,"[begleiten, crm, system, entwicklung, funktion..."


In [24]:
bigram = gensim.models.Phrases(df_preprocessedDataset['token_stem_spRm'].tolist(), min_count=2, threshold=2) # higher threshold fewer phrases.
bigram_mod = gensim.models.phrases.Phraser(bigram)
def make_bigrams(text):
    return bigram_mod[text]

In [25]:
# Form Bigrams
df_preprocessedDataset['token_stem_spRm_bigram'] = df_preprocessedDataset['token_stem_spRm'].apply(make_bigrams)

### Identify most frequently occuring words

In [None]:
# Create a list containing all the words in a dataframe
all_words_df = [word for item in list(df_preprocessedDataset['tokenized']) for word in item]

# Use nltk fdist to get a frequency distribution of all words
fdist_words = FreqDist(all_words_df)
print(len(fdist_words)) # number of unique words
print(type(fdist_words))

#print(fdist_words.items())

In [None]:
total_unique_words = len(fdist_words)
sorted_freqDist_words = fdist_words.most_common()
maxFreq = sorted_freqDist_words[0][1]
print(maxFreq)
freq_values = [sorted_freqDist_words[i][1] for i in range(total_unique_words)]
avgFreq = np.mean(freq_values)
print(avgFreq)

In [None]:
#Considering words with frequency of 100 or more
top_words = [sorted_freqDist_words[i][0] for i in range(total_unique_words) if sorted_freqDist_words[i][1] >= 100]
print(len(top_words))
#print(top_words)

In [None]:
def most_appeared(text):
    return [word for word in text if word in top_words]

In [None]:
#Reduce the words in tokenized column to the words with frequency more than 100. 
df_preprocessedDataset['tokenized'] = df_preprocessedDataset['tokenized'].apply(most_appeared)

In [None]:
df_preprocessedDataset.head(20)

In [None]:
# only keep articles with more than 10 tokens, otherwise too short
df_preprocessedDataset = df_preprocessedDataset[df_preprocessedDataset['tokenized'].map(len) >= 10]
# make sure all tokenized items are lists
df_preprocessedDataset = df_preprocessedDataset[df_preprocessedDataset['tokenized'].map(type) == list]
df_preprocessedDataset.reset_index(drop=True,inplace=True)

In [None]:
print("After cleaning and excluding short aticles, the dataframe now has:", len(df_preprocessedDataset), "articles")

### Split data to training and testing 

In [None]:
# create a mask of binary values to split into train and test
msk = np.random.rand(len(df_preprocessedDataset)) < 0.9960
msk

In [None]:
train_df = df_preprocessedDataset[msk]
train_df.reset_index(drop=True,inplace=True)

test_df = df_preprocessedDataset[~msk]
test_df.reset_index(drop=True,inplace=True)

In [None]:
train_df.head()

### Train the LDA model on training data

In [None]:
def train_lda(data, n=10):
    """
    This function trains the lda model
    We setup parameters like number of topics, the chunksize to use in Hoffman method
    We also do 2 passes of the data since this is a small dataset, so we want the distributions to stabilize
    """
    num_topics = n
    chunksize = 300
    dictionary = corpora.Dictionary(data['tokenized'])
    corpus = [dictionary.doc2bow(doc) for doc in data['tokenized']]
    t1 = time.time()
    # low alpha means each document is only represented by a small number of topics, and vice versa
    # low eta means each topic is only represented by a small number of words, and vice versa
    lda = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary,
                   alpha=1e-2, eta=0.5e-2, chunksize=chunksize, minimum_probability=0.0, passes=2)
    t2 = time.time()
    print("Time to train LDA model on ", len(df_preprocessedDataset), "documents: ", (t2-t1)/60, "min")
    return dictionary,corpus,lda

In [None]:
dictionary,corpus,lda = train_lda(train_df, 10)

In [None]:
lda.save('LDA_Approach_1.model')

In [None]:
from gensim import corpora, models, similarities
model =  models.LdaModel.load('LDA_Approach_1.model')

In [None]:
# print all topics
model.show_topics(num_topics=20, num_words=20)

In [None]:
with open('dictionary_LDA_A1', 'wb') as output:
    pickle.dump(dictionary, output)
    
with open('corpus_LDA_A1', 'wb') as output:
    pickle.dump(corpus, output)

In [None]:
# Save model to disk.
from gensim.test.utils import datapath
temp_file = datapath("model")
lda.save(temp_file)

pickle.dump(lda, open('model_LDA_A1', 'wb'))

In [None]:
# Load a potentially pretrained model from disk.
lda2 = LdaModel.load(temp_file)

In [None]:
# show_topics method shows the the top num_words contributing to num_topics number of random topics
lda.show_topics(num_topics=13, num_words=20)

In [None]:
for t_id in range (2):
    print("TopicID: " + str(t_id))
    topics = lda.show_topic(topicid=t_id, topn=20)
    for topic in topics:
        print(topic[0] + ": " + str(topic[1]))
    print()


# Random project from training data

In [None]:
#Select an article at random from train_df
random_index = int(np.random.randint(len(train_df), size=[1, 1]))
print(random_index)

In [None]:
data_to_check = train_df.iloc[random_index,2]
bow = dictionary.doc2bow(data_to_check)
doc_distribution = np.array([topic[1] for topic in lda.get_document_topics(bow=bow)])

In [None]:
print(train_df.iloc[random_index,2])

In [None]:
print(doc_distribution)
print(len(doc_distribution))
np.argsort(-doc_distribution)[:3]
print(doc_distribution)
print(len(doc_distribution))

In [None]:
# bar plot of topic distribution for this document
def plot_topic_dist(doc_distr, index):
    """
    This function plots the topic distrubtion for a given document
    It takes two parameters
    (1) doc_distr = type: list of floats, list of topic probability distribution in a document
    (2) index = type: int, index number of document to plot
    We also do 2 passes of the data since this is a small dataset, so we want the distributions to stabilize
    """
    fig, ax = plt.subplots(figsize=(12,8));
    # the histogram of the data
    patches = ax.bar(np.arange(len(doc_distr)), doc_distr)
    ax.set_xlabel('Topic ID', fontsize=15)
    ax.set_ylabel('Topic Probability Score', fontsize=15)
    ax.set_title("Topic Distribution for Project in Index " + str(index), fontsize=20)
    ax.set_xticks(range(0,10))
    x_ticks_labels = ['ERP/SAP','SW_Dev/Web','IT_App_Mgr/SW_Dev_Arch','SW_Dev/DevOps','Sys_Admin/Support', 'IT_Admin_SW/Oracle/Ops','Data/Ops','IT_Process_Mgr/Consultant', 'MS_DEV/Admin','Business_Analyst/Consulting']
    ax.set_xticklabels(x_ticks_labels, rotation='vertical', fontsize=8)
    fig.tight_layout()
    return plt.show()

In [None]:
plot_topic_dist(doc_distribution, random_index)

In [None]:
lda_model =  models.LdaModel.load('LDA_Approach_1.model')

In [None]:
lda_model.show_topics()

In [None]:
doc_distribution1 = np.array([topic[1] for topic in lda_model.get_document_topics(bow=bow)])
labels = np.argmax(doc_distribution1)
print(doc_distribution1)