In [26]:
import nltk
from nltk.tokenize import sent_tokenize
import wikipediaapi

In [27]:
# Function to fetch content from Wikipedia for a given subject
def fetch_wikipedia_content(subject):
    wiki_wiki = wikipediaapi.Wikipedia('english')
    page_py = wiki_wiki.page(subject)

    if not page_py.exists():
        print(f"Wikipedia page for '{subject}' does not exist.")
        return None

    return page_py.text


In [28]:
# Function to save content to a file
def save_to_file(content, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        file.write(content)


In [29]:
# Function to read content from a file
def read_from_file(filename):
    with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
        return file.read()

In [30]:
# Function to summarize a text within the context window
def summarize_text(text, target_length):
    summary = ""
    sentences = sent_tokenize(text)
    length = 0
    for sentence in sentences:
        if length + len(sentence.split()) <= target_length:
            summary += sentence + " "
            length += len(sentence.split())
        else:
            break
    return summary.strip()

In [31]:
def perform_summarization(content, output_filename, context_window_limit):
    length = len(content.split())
    target_length = int(length * (context_window_limit / (context_window_limit + 4000)))

    summary_list = []
    start_index = 0

    while start_index < length:
        slice_content = content[start_index:]
        summary_slice = summarize_text(slice_content, context_window_limit)
        summary_list.append(summary_slice)
        start_index += len(summary_slice.split())

    collated_summary = " ".join(summary_list)

    while len(collated_summary.split()) > context_window_limit:
        collated_summary = summarize_text(collated_summary, context_window_limit)

    with open(output_filename, "w") as file:
        file.write(collated_summary)

    return collated_summary

In [32]:
# Main function
def main():
    # Take input from the user for the Wikipedia subjects
    subject1 = input("Enter the first subject: ")
    subject2 = input("Enter the second subject: ")

    # Fetch content from Wikipedia
    content1 = fetch_wikipedia_content(subject1)
    content2 = fetch_wikipedia_content(subject2)

    if content1 and content2:
        # Save content to input files
        save_to_file(content1, "input_text1.txt")
        save_to_file(content2, "input_text2.txt")

        # Summarize text files
        contents_1 = read_from_file("input_text1.txt")
        contents_2 = read_from_file("input_text2.txt")

        perform_summarization(contents_1, "summarized_text1.txt", 128)
        perform_summarization(contents_2, "summarized_text2.txt", 128)

        # Generate the query
        query = f"\nDocument 1 summary: {read_from_file('summarized_text1.txt')}\n\nDocument 2 summary: {read_from_file('summarized_text2.txt')}"

        print(query)


In [33]:
if __name__ == "__main__":
    main()

Enter the first subject: Natural Language Processing
Enter the second subject: Artificial Intelligence

Document 1 summary: Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. History
N

Adding cosine similarity for enhancing the quality and relevance of the generated summary by considering the relationships between sentences.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [35]:
def compute_similarity_matrix(sentences):
    vectorizer = CountVectorizer().fit_transform(sentences)
    similarity_matrix = cosine_similarity(vectorizer)
    return similarity_matrix

In [39]:
# Function to summarize a text within the context window using cosine similarity
def summarize_text_cosine_similarity(text, target_length):
    sentences = sent_tokenize(text)
    length = 0
    summary = ""
    
    # Compute cosine similarity matrix between sentences
    similarity_matrix = compute_similarity_matrix(sentences)
    
    while length < target_length and length < len(sentences):
        # Find the sentence most similar to the existing summary
        most_similar_index = similarity_matrix[length].argsort()[-1]
        
        # Add the selected sentence to the summary
        summary += sentences[most_similar_index] + " "
        
        # Update the length based on the added sentence
        length += len(sentences[most_similar_index].split())

        # Set the similarity scores of the added sentence to 0 to avoid repetition
        similarity_matrix[:, most_similar_index] = 0

    return summary.strip()

In [40]:
def perform_summarization(content, output_filename, context_window_limit):
    length = len(content.split())
    target_length = int(length * (context_window_limit / (context_window_limit + 4000)))

    summary = summarize_text_cosine_similarity(content, target_length)

    with open(output_filename, "w") as file:
        file.write(summary)

    return summary


In [42]:
def main():

        contents_1 = read_from_file("input_text1.txt")
        contents_2 = read_from_file("input_text2.txt")

        perform_summarization(contents_1, "Second_summarized_text1.txt", 128)
        perform_summarization(contents_2, "Second_summarized_text2.txt", 128)

        # Generate the query
        query = f"\nDocument 1 summary: {read_from_file('Second_summarized_text1.txt')}\n" \
                f"\nDocument 2 summary: {read_from_file('Second_summarized_text2.txt')}"

        print(query)


if __name__ == "__main__":
    main()


Document 1 summary: Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. containing words or structures that have not been seen before) and erroneous input (e.g. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above). Syntactic analysis
Grammar induction
Generate a formal grammar that describes a language's syntax. Furthermore, many other languages in non-Western scripts (e.g. from a dictionary or an online resource such as WordNet. The more general task of coreference resolution also includes identifying so-called "bridging relationships" involving referring expressions. It has thus been subject to a number of shared tasks since 2011.

Document 2 s