## Portfolio Assignment week 02

### Text clustering 

Read, execute and analyse the code in the notebook tutorial_clustering_words. Then *choose one* of the assignments a), b) or c). 

a) read the article Clinical Documents Clustering Based on Medication/Symptom Names Using Multi-View Nonnegative Matrix Factorization. you can find the article <a href = 'https://pubmed.ncbi.nlm.nih.gov/26011887/'> here</a>. Explain the similarities of this notebook and the article. Explain in your own words what need to be added to this notebook to reproduce the article. There is no need to code the solution, you can mention in your own words the steps. 

b) Improve the outcome improving the data preprocessing and the hyper parameter configurations. Explain your choices. Your solution should be a coded solution with comments. Are there any other weighting solutions next to TF-IDF?

c) Provide a text clustering solution with your own data of interest, you can follow a similar approach to the one in the tutorial_clustering_words notebook. 

## Answer

### b


**Data preprocessing in text data**
1. **Tokenization**: 
    Break down the text into individual words or phrases, which will be used as features for the machine learning algorithm.
2. **Stop Word Removal**: 
    Remove common words that do not carry much meaning, such as "the", "and", and "a".
3. **Stemming/Lemmatization**: 
    Reduce words to their root form to avoid having multiple forms of the same word.
4. **Normalization**: 
    Convert all text to lowercase to avoid having multiple forms of the same word.
5. **Counting**: 
    Count the frequency of each word in the document and create a vector of word counts.

**TF-IDF stands for Term Frequency-Inverse Document Frequency.**

It is a numerical representation technique commonly used in natural language processing to evaluate the importance of words within a document relative to a collection of documents. The function of TF-IDF can be summarized as follows:

1. **Term Frequency (TF):** TF measures the frequency of a word within a document. It gives higher weight to words that appear more often in the document, as they are likely to be more relevant to its content.

2. **Inverse Document Frequency (IDF):** IDF assesses the uniqueness of a word by calculating how often it appears across the entire collection of documents. Words that appear in many documents receive lower IDF scores, while those that appear in fewer documents are assigned higher IDF scores.

3. **Combining TF and IDF:** TF-IDF is computed by multiplying the TF of a word in a document by its IDF in the collection. This multiplication emphasizes words that are frequent within a document (high TF) but not so common across all documents (high IDF). This results in a numerical representation that reflects the importance of a word to a specific document within the context of the entire corpus.

4. **Applications:** TF-IDF is commonly used for information retrieval, text mining, and document clustering. It helps in identifying keywords, classifying documents, and finding relevant documents in search engines.

In essence, TF-IDF quantifies how relevant a word is to a document within the context of a larger collection. Words that are frequent in the document but rare in the collection receive higher TF-IDF scores, indicating their significance to the content of the document.

**The Bag of Words (BoW) algorithm**

it is a technique used in Natural Language Processing (NLP) for text modeling. It is a method of feature extraction that happens with text data. The BoW algorithm is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words. The model is only concerned with whether known words occur in the document, not where in the document. The BoW algorithm is used to represent text data when modeling text with machine learning algorithms. The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.

The BoW algorithm can be visualized using a table, which contains the count of words corresponding to the word itself. In practice, only a few words from the vocabulary, more preferably most common words are used to form the vector. The range of vocabulary is a big issue faced by the BoW model. For example, if the model comes across a new word it has not seen yet, rather we say a rare, but informative word like Biblioklept (means one who steals books), it will not be able to recognize it. Therefore, we select a particular number of most frequently used words. 

In conclusion, the BoW algorithm is a simple and effective way to represent text data for machine learning algorithms. It is widely used in NLP for tasks such as language modeling and document classification.

**K-Means**

K-Means is an algorithm that groups similar data points into clusters by iteratively adjusting cluster centers based on data point assignments. It's used for unsupervised clustering tasks like customer segmentation and data grouping.

In conclusion,After upgrading the cleaning operation and improving the hyperparameters of the model, the classifications became closer in terms of meaning, and a better classification was done. The Bag of Words function is another algorithm used for feature extraction, and the clustering model was changed from MNF to Kmeans to evaluate the effectiveness of these factors in the model. Kmeans is classified based on distance. However, TF-IDF obtained a better answer because it selects better features based on logarithmic weighting algorithms compared to Bag of Words. The only feature selection criterion for Bag of Words is based on word count. The Bag of Words function is a natural language processing technique used to preprocess text data by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words. The Bag of Words model is a simple and flexible way of extracting features from documents. TF-IDF stands for term frequency-inverse document frequency and is a measure used in the fields of information retrieval and text mining that can quantify the importance or relevance of string representations in a document amongst a collection of documents.

**Refrences:** 
1.https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html
2.https://towardsdatascience.com/how-to-cluster-similar-sentences-using-tf-idf-and-graph-partitioning-in-python-5fb70627a4e0
3.https://www.mygreatlearning.com/blog/bag-of-words/
4.https://www.geeksforgeeks.org/k-means-clustering-introduction/

**More imformation:**
Shiva and i worked together

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
import nltk
import re
import string
import glob
import pandas as pd
from pathlib import Path

Load text files

In [16]:

#Create an empty DataFrame called 'df' with columns 'docid' and 'text'
df = pd.DataFrame(columns=['docid','text'])

# Get all files ending with '.txt' in the 'data' directory
docs = [x for x in glob.glob("data/*.txt")]

#Fill the DataFrame with data from the files
for doc in docs:
    txt = Path(doc).read_text(encoding="utf8")
    df.loc[len(df.index)] = [doc[:-4], txt]
      
df = df.set_index('docid')

In [17]:
df.head()

Unnamed: 0_level_0,text
docid,Unnamed: 1_level_1
data\15939911,CASE: A 28-year-old previously healthy man pre...
data\16778410,The patient was a 34-yr-old man who presented ...
data\17803823,A 23 year old white male with a 4 year history...
data\18236639,A 30-year-old female (65 kg) underwent rhinopl...
data\18258107,"Here, we describe another case in a 60-year-ol..."


## Cleaning the text

optimize the cleaning function

In [18]:
def clean_text(text):
    # Remove bad characters
    text = text.strip()
    text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
    text = text.replace("#","").replace("\u200c"," ").replace("/t"," ").replace("https:","")

    # Convert text to lowercase
    text = text.lower()

    # Remove square brackets and contents inside them
    text = re.sub('\[.*?\]', ' ', text)

    # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)

    # Remove alphanumeric words
    text = re.sub('\w*\d\w*', ' ', text)

    # Remove special character '�'
    text = re.sub('�', ' ', text)

    return text

cleaned = lambda x: clean_text(x)


In [19]:
# Noun extract and lemmatize function
def nouns(text):
    '''Given a string of text, tokenize the text 
    and pull out only the nouns.'''
    # create mask to isolate words that are nouns
    is_noun = lambda pos: pos[:2] == 'NN'
    # store function to split string of words 
    # into a list of words (tokens)
    tokenized = word_tokenize(text)
    # store function to lemmatize each word
    wordnet_lemmatizer = WordNetLemmatizer()
    # use list comprehension to lemmatize all words 
    # and create a list of all nouns
    all_nouns = [wordnet_lemmatizer.lemmatize(word) \
    for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    
    #return string of joined list of nouns
    return ' '.join(all_nouns)


In [20]:
# Clean Text
df["text"] = df["text"].apply(cleaned)
data_nouns = pd.DataFrame(df["text"].apply(nouns))
# Visually Inspect
data_nouns.head()

Unnamed: 0_level_0,text
docid,Unnamed: 1_level_1
data\15939911,case year man week history palpitation symptom...
data\16778410,patient man complaint fever cough smoker histo...
data\17803823,year male year history crohn disease day histo...
data\18236639,year female kg rhinoplasty anaesthesia combina...
data\18258107,case year man francisco pork philippine june m...


improve stop words

In [24]:
# load stop words file
SW_file = open("stop_words.txt", "r", encoding="utf-8")

# Read line by line
SW = SW_file.read().splitlines()

# Close file
SW_file.close()



Optimize the hyper parameters

In [None]:
# ngram_range let us to check the double words
# min_df help to trim  not important words

tv_noun = TfidfVectorizer(stop_words=Stopwords, ngram_range = (1,2), max_df = .8, min_df = 5)

# Fit and Transform speech noun text to a TF-IDF Doc-Term Matrix
data_tv_noun = tv_noun.fit_transform(data_nouns.text)
# Create data-frame of Doc-Term Matrix with nouns as column names
data_dtm_noun = pd.DataFrame(data_tv_noun.toarray(), columns=tv_noun.get_feature_names_out())
data_dtm_noun.index = df.index
# Visually inspect Document Term Matrix
data_dtm_noun.head()

Unnamed: 0_level_0,abdomen,abdomen pelvis,ablation,abnormality,absence,accumulation,acid,acidosis,activity,acuity,...,year male,year man,year patient,year surgery,year treatment,year woman,zone,μg,μl,μmol
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
data\15939911,0.0,0.0,0.496285,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.063863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
data\16778410,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
data\17803823,0.080591,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.097003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
data\18236639,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.334607,0.0,0.0
data\18258107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065868,0.0,0.0,0.0,0.0,0.0,0.0,0.127966,0.0


In [26]:
def display_topics(model, feature_names, num_top_words, topic_names=None):
    '''Given an NMF model, feature_names, and number of top words, print 
       topic number and its top feature names, up to specified number of top words.'''
    # iterate through topics in topic-term matrix, 'H' aka
    # model.components_
    for ix, topic in enumerate(model.components_):
        #print topic, topic number, and top words
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i] \
             for i in topic.argsort()[:-num_top_words - 1:-1]]))

Optimize the number of clusters

In [None]:
nmf_model = NMF(11)
# Learn an NMF model for given Document Term Matrix 'V' 
# Extract the document-topic matrix 'W'
doc_topic = nmf_model.fit_transform(data_dtm_noun)
# Extract top words from the topic-term matrix 'H' 
display_topics(nmf_model, tv_noun.get_feature_names_out(), 15)


Topic  0
heart, day, pressure, blood, hour, blood pressure, ejection fraction, ejection, function, ml, failure, rate, fraction, level, tachycardia

Topic  1
tumor, cell, lymph, lesion, tumor cell, node, lymph node, metastasis, fig, cm, nodule, mass, resection, ml, tomography

Topic  2
valve, echocardiography, leaflet, atrium, regurgitation, bypass, suture, tee, ventricle, ablation, artery, aorta, defect, murmur, failure

Topic  3
age, age year, parent, year age, muscle, month, gait, mri, seizure, brain, activity, child, level, gene, week

Topic  4
figure, cell, pain, vein, cm, carcinoma, examination, tumor, figure figure, malignancy, biopsy, muscle, wall, figure patient, sign

Topic  5
lung, day, chest, treatment, fig, culture, therapy, hospital, tuberculosis, respiratory, month, dyspnea, transplantation, effusion, sputum

Topic  6
mass, duct, cm, ct, tumour, fig, lesion, liver, carcinoma, examination, resection, abdomen, cyst, pain, wall

Topic  7
dl, mg, mg dl, level, platelet, coun

Bag of word is the other way to extract feature from the text

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
CountVec = CountVectorizer(ngram_range=(1,1), stop_words=Stopwords)

data_tv_noun = CountVec.fit_transform(data_nouns.text)
# Create data-frame of Doc-Term Matrix with nouns as column names
data_dtm_noun = pd.DataFrame(data_tv_noun.toarray(), columns=CountVec.get_feature_names_out())
data_dtm_noun.index = df.index
# Visually inspect Document Term Matrix
data_dtm_noun.head()


Unnamed: 0_level_0,abbott,abdomen,abdominal,abdominis,abdomino,abdominopelvic,ablation,abnormal,abnormality,abr,...,µg,µl,µmol,µv,μg,μiu,μkat,μl,μm,μmol
docid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
data\15939911,0,0,0,0,0,0,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0
data\16778410,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
data\17803823,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
data\18236639,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,0
data\18258107,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


We enhanced the data cleaning process by implementing more comprehensive techniques and utilized an expanded stop words database. Additionally, we fine-tuned the parameters of the TF-IDF algorithm to achieve optimal performance. Furthermore, we optimized the number of clusters and employed the K-means++ clustering algorithm for improved accuracy. In addition to the TF-IDF algorithm, we also incorporated the Bag of Words algorithm for feature extraction. These enhancements collectively resulted in significant improvements to the overall analysis and classification process.

Clustring with K-means algorithm

In [None]:
from sklearn.cluster import KMeans

# Specify the number of clusters
num_clusters = 7

# Create an instance of the KMeans clustering algorithm
kmeans = KMeans(n_clusters=num_clusters, init="k-means++")

# Fit the KMeans algorithm to the data
kmeans.fit(data_dtm_noun)

# Get the cluster labels for each data point
clusters = kmeans.labels_.tolist()



In [None]:
print("Top terms per cluster:")

# Get the indices that would sort the cluster centers
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]

# Get the feature names from the TfidfVectorizer
terms = tv_noun.get_feature_names_out()

# Iterate over each cluster
for i in range(num_clusters):
    print("Cluster %d:" % i)
    
    # Iterate over the indices of the top terms in the current cluster
    for ind in order_centroids[i, :40]:
        print(' %s' % terms[ind], end=',')
    
    # Print a new line after printing the top terms for the current cluster
    print("\n")


Top terms per cluster:
Cluster 0:
 rash, medication, face, supplementation, diagnosis, deficiency, injection, heart, platelet, drug, day, prednisolone, month, blood, week, history, neck, steroid, fistula, thrombocytopenia, count, skin, age year, chest, arm, figure, dl, mg, state, examination, presentation, autoimmune, test, dose, abdomen, improvement, echocardiogram, tube, hand, pain,

Cluster 1:
 tumor, mass, cell, cm, figure, lesion, fig, examination, resection, lymph, ct, carcinoma, metastasis, tumor cell, month, nodule, node, biopsy, ml, surgery, dl, lymph node, diagnosis, level, size, tomography, history, specimen, cancer, day, duct, disease, abdomen, eu, wall, grade, invasion, diameter, tissue, calcium,

Cluster 2:
 age, muscle, year age, parent, acidosis, birth, gene, child, analysis, liver, mri, lactate, seizure, acid, blood, age year, hour, pregnancy, section, care, month, activity, blocker, pressure, investigation, brain, blood pressure, matter, syndrome, cord, sequencing, va