## PART 1: PREPARING DATA 

Before we before the process of text summarization, we must make the text to be summarized easily accessible.

In [1]:
text=''' “To be ‘feminist’ in any authentic sense of the term is to want for all people, female and male, liberation from sexist role patterns, domination, and oppression.” bell hooks made this clear and powerful statement in her 1981 study of sexism, racism, and the feminist and civil rights movements Ain't I a Woman: Black Women and Feminism. Almost 40 years on, the world is still reckoning with pervasive and inexcusable gender inequality underpinned by bias and sexism, and research and health care are no exception. Today, The Lancet publishes a theme issue on advancing women in science, medicine, and global health, with the aim of showcasing research, commentary, and analysis that provide new explanations and evidence for action towards gender equity. This theme issue is the result of a call for papers that led to over 300 submissions from more than 40 countries. The overwhelming conclusion from this collection of work is that, to achieve meaningful change, actions must be directed at transforming the systems that women work within—making approaches informed by feminist analyses essential.
It is well established that women are under-represented in positions of power and leadership, undervalued, and experience discrimination and gender-based violence in scientific and health disciplines across the world. Intersectional approaches have provided insights into how other categories of difference such as ethnicity, class, geography, disability, and sexuality interact with gender to compound inequalities. Most submissions to this theme issue came from high-income countries, highlighting the need to support scholarship from the Global South. Geordan Shannon and colleagues provide a global overview of gender inequality in science, medicine, and global health, and discuss the evidence for the substantial health, social, and economic gains that could be achieved by addressing this inequality. Indeed, some studies, including one in this issue by Cassidy Sugimoto and colleagues, show that more diverse and inclusive teams lead to better science and more successful organisations.
Despite decades of recognition, these problems have proved stubbornly persistent. It is now commonplace for organisations to make public statements valuing diversity, hire diversity officers, and implement programmes to advance women's careers. Yet, all too often, such programmes locate the source of the problem, and hence the solution, within women and their own behaviour. Thus, although actions such as mentoring and skills training might be well intentioned and advantageous to a degree, they often fail to engage with broader features of systems that disproportionately privilege men. For instance, Holly Witteman and colleagues show, using data from a federal funder, how gender bias disadvantages women applying for grant funding.
Reflecting on these biases can be difficult for professions like science and medicine that are grounded in beliefs of their own objectivity and evidence-driven thinking. A trio of papers in this issue demonstrates the value of critical perspectives in this regard. Malika Sharma explains how the “historical gendering of medicine prioritises particular types of knowledge (and ways of producing that knowledge), and creates barriers for critical, and specifically feminist, research and practice”. Feminist and other critical perspectives enable researchers to question the underlying assumptions that produce and maintain social hierarchies, and in doing so, imagine ways to transform fields and practices to make them more equitable and inclusive. Likewise, Sara Davies and colleagues argue that a feminist research agenda is key to advancing gender equality in global health, and Kopano Ratele and colleagues explain why efforts to engage men in advancing gender equality must be grounded in an appreciation of theories of masculinity.
For actions to have lasting and far-reaching consequences, they must therefore be directed at creating institutional-level change. Several pieces in this theme issue discuss such approaches, with a Review by Imogen Coe and colleagues providing a toolbox of organisational best practices towards gender equality in science and medicine. The Lancet's commitments to addressing gender bias in publishing are detailed in a Comment. Gender equity is not only a matter of justice and rights, it is crucial for producing the best research and providing the best care to patients. If the fields of science, medicine, and global health are to hope to work towards improving human lives, they must be representative of the societies they serve. The fight for gender equity is everyone's responsibility, and this means that feminism, too, is for everybody—for men and women, researchers, clinicians, funders, institutional leaders, and, yes, even for medical journals.
'''

In [2]:
print(text)

 “To be ‘feminist’ in any authentic sense of the term is to want for all people, female and male, liberation from sexist role patterns, domination, and oppression.” bell hooks made this clear and powerful statement in her 1981 study of sexism, racism, and the feminist and civil rights movements Ain't I a Woman: Black Women and Feminism. Almost 40 years on, the world is still reckoning with pervasive and inexcusable gender inequality underpinned by bias and sexism, and research and health care are no exception. Today, The Lancet publishes a theme issue on advancing women in science, medicine, and global health, with the aim of showcasing research, commentary, and analysis that provide new explanations and evidence for action towards gender equity. This theme issue is the result of a call for papers that led to over 300 submissions from more than 40 countries. The overwhelming conclusion from this collection of work is that, to achieve meaningful change, actions must be directed at trans

## PART 2: SUMMARIZATION USING THE TF-IDF ALGORITHM

In [3]:
#Importing all the necessary libraries 
import math
from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords
#nltk.download('punkt')
#nltk.download('stopwords')

### Creating Functions: First, we'll focus on defining the functions needed to execute the summarization via TF-IDF.

In [4]:
#1:A function to create frequency tables 
def CreateFrequencyTable(text) -> dict:
    stopWords=set(stopwords.words("english"))
    words=word_tokenize(text)
    ps=PorterStemmer()
    freqTable=dict()
    for word in words:
        word=ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word]+=1
        else:
            freqTable[word]=1
    return freqTable

In [5]:
#2:A function to calculate the frequency of words in the text to be summarized.
def CreateFrequencyMatrix(sentences):
    frequency_matrix={}
    stopWords=set(stopwords.words("english"))
    ps=PorterStemmer()
    for sent in sentences:
        freq_table={}
        words=word_tokenize(sent)
        for word in words:
            word=word.lower()
            word=ps.stem(word)
            if word in stopWords:
                continue
            if word in freq_table:
                freq_table[word]+=1
            else:
                freq_table[word]=1
        frequency_matrix[sent[:15]]=freq_table
    return frequency_matrix

In [6]:
#3: Defining a function which can calculate the term frequency 
def CreateTFmatrix(freq_matrix):
    tf_matrix={}
    for sent, f_table in freq_matrix.items():
        tf_table={}
        count_words_in_sentence=len(f_table)
        for word, count in f_table.items():
            tf_table[word]=count/count_words_in_sentence
        tf_matrix[sent]=tf_table
    return tf_matrix

In [7]:
#4: Defining a function which can calculate the number of sentences that contain a word
def SentPerWord(freq_matrix):
    word_per_doc_table = {}
    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word]+=1
            else:
                word_per_doc_table[word]=1
    return word_per_doc_table

In [8]:
#5:Defining a function which calculates the IDF and gives a measure of how rarely a word occurs
def CreateIDFmatrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix={}
    for sent, f_table in freq_matrix.items():
        idf_table={}
        for word in f_table.keys():
            idf_table[word]=math.log10(total_documents/float(count_doc_per_words[word]))
        idf_matrix[sent]=idf_table
    return idf_matrix

In [9]:
#6:Finally, we define a function which generates a TFIDF matrix
def CreateTFIDFmatrix(tf_matrix, idf_matrix):
    tf_idf_matrix={}
    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):
        tf_idf_table={}
        for (word1, value1), (word2, value2) in zip(f_table1.items(),                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1]=float(value1 * value2)
        tf_idf_matrix[sent1]=tf_idf_table
    return tf_idf_matrix

In [10]:
#7:Scoring sentences using the obtained TF-IDF Matrix:
def ScoreSent(tf_idf_matrix) -> dict:
    sentenceValue={}
    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence=0
        count_words_in_sentence=len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence+=score
        sentenceValue[sent]=total_score_per_sentence/count_words_in_sentence
    return sentenceValue

In [11]:
#8:Defining a function which uses the above function to determine how important the sentences in the text are.
def FindAvgScore(sentenceValue) -> int:
    sumValues=0
    for entry in sentenceValue:
        sumValues+=sentenceValue[entry]
    average=(sumValues / len(sentenceValue))
    return average

In [12]:
#9:A function to generate the summary 
def GenerateSummary(sentences, sentenceValue, threshold):
    sentence_count=0
    summary=''
    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]]>=(threshold):
            summary+=" "+sentence
            sentence_count+=1
    return summary

### Calling Functions: Now that we have defined all  the required functions, we use them on the text which was initialised earlier. For the sake of reusability, we define the order of the function calls in another function.

In [13]:
#Now, we use all the functions above to generate the summary we require. 
def run_summarization(text):
    
    #Step1:Tokenizing the Sentences
    Sentences=sent_tokenize(text)
    TotalDocs=len(Sentences)
    
    #Step2:Creating the Frequency matrix of the words in each sentence.
    FreqMatrix=CreateFrequencyMatrix(Sentences)

    #Step3:Calculating Term Frequency and generating a matrix which indicates how frequently a word occurs
    TFmatrix=CreateTFmatrix(FreqMatrix)

    #Step4:Determing how many sentences contain a particular word 
    SPW=SentPerWord(FreqMatrix)

    #Step5:Generating an IDF matrix which will reflect some of the rarely occuring words.
    IDFmatrix=CreateIDFmatrix(FreqMatrix,SPW,TotalDocs)

    #Step6:Calculating TF-IDF matrix
    TFIDFM=CreateTFIDFmatrix(TFmatrix,IDFmatrix)

    #Step7:Scoring the tokenized sentences
    SentScore=ScoreSent(TFIDFM)

    #Step8:Using the above score, find the threshold appropriate to generate the summary.
    Threshold=FindAvgScore(SentScore)
    #print(threshold)

    #Step9:Generating the Summary
    Summary=GenerateSummary(Sentences,SentScore,1.3*Threshold)
    return Summary

In [14]:
#Step10:Calling the function which generates the summary
if __name__ == '__main__':
    result=run_summarization(text)
    print(result)

 This theme issue is the result of a call for papers that led to over 300 submissions from more than 40 countries. Despite decades of recognition, these problems have proved stubbornly persistent. Yet, all too often, such programmes locate the source of the problem, and hence the solution, within women and their own behaviour. Reflecting on these biases can be difficult for professions like science and medicine that are grounded in beliefs of their own objectivity and evidence-driven thinking. A trio of papers in this issue demonstrates the value of critical perspectives in this regard. For actions to have lasting and far-reaching consequences, they must therefore be directed at creating institutional-level change. The Lancet's commitments to addressing gender bias in publishing are detailed in a Comment.


### ANALYSIS:

We find that the summary is coherent, contextually relevant and informative. However, it is also noteworthy that it is vague. The article was originally based on feminism and just by reading the summary, a reader might not understand that. Furthermore, the ending seems to be rather abrupt.

## PART 3: SUMMARIZATION USING THE TEXT RANK ALGORITHM

The Text Rank Algorithm is based on the PageRank Algorithm, which is used to rank websites for different web searches. In case of PageRank, we determine the probability of a user going from one page to another and the TextRank algorithm works the same way. Instead of webpages, we use sentences. 

### NOTE: The TextRank Algorithm has already been built into a summarize function available in 'gensim', an open source library for unsupervised topic modelling and natural language processing.

Since our focus is on comparing the text summarization techniques, we'll implement the textrank algorithm directly.

In [16]:
#Importing the necessary libraries
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords

In [20]:
print(summarize(text,word_count=125))

Today, The Lancet publishes a theme issue on advancing women in science, medicine, and global health, with the aim of showcasing research, commentary, and analysis that provide new explanations and evidence for action towards gender equity.
The overwhelming conclusion from this collection of work is that, to achieve meaningful change, actions must be directed at transforming the systems that women work within—making approaches informed by feminist analyses essential.
Geordan Shannon and colleagues provide a global overview of gender inequality in science, medicine, and global health, and discuss the evidence for the substantial health, social, and economic gains that could be achieved by addressing this inequality.
Several pieces in this theme issue discuss such approaches, with a Review by Imogen Coe and colleagues providing a toolbox of organisational best practices towards gender equality in science and medicine.


### ANALYSIS: 

Not only is the above summary coherent, gramatically correct, contextually relevant and informative, it is also clear. It is very evident to any reader that the article focuses on feminism. Apart from this, the summary is also well structured. 

## PART 4: COMPARSION OF OBTAINED SUMMARIES

### On the Basis of Documents:

Both algorithms have been used for single document summarization here. However, it is noteworthy that the TextRank algorithm can and is often used for multi-document summarization. On the contrary, the TFIDF Algorithm is mostly used for single document summarization. 

### On the Basis of Nature:

Both algorithms are extractive in nature. This means that both technqiues focus on finding the essential sections of the original text and generating summaries by using them.

### On the Basis of Discourse:

The TextRank algorithm is a graph based approach, whereas the TF-IDF algorithm is a frequency based approach and mostly relies purely on math.

### Text Quality:

Both summaries are gramatically accurate and coherent. They are informative and non-redundant. It is observed that the summary obtained through the TextRank Algorithm is more structured and thus provides more information to whoever reads it. Also, it is evident that the summary obtained through the TextRank algorithm is more clear than the other.

### Nature of Summary: 

While the TF-IDF summary provides clear statistical insights, the TextRank summary gives a contextually clear and focused picture of the topic being discussed. Thus, it is intuitively evident that the point of discussion is feminism after reading the latter summary. This is not the case for the TF-IDF summary.

### Conclusive Comparison:

The TextRank Summary being well structured and coherent, is a generalised one, useful for all readers. On the other hand, the TF-IDF Summary will be useful for readers who are topic oriented, i.e those who are looking for information on feminism. 

### Link for Text used as dataset here:


https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(19)30239-9/fulltext