# **Table Summarization**

The test question is as follows: 

Using any of the available language model via any open-source platform, can you try summarizing the following table to explain the values in a human readable format in either of the following ways:

a. Column wise <br>
b. Row wise <br>
c. Complete table summarization

---
## Solution Approch (Using Column-wise):
1. Creating a sentence for each datapoint, utilized the column and row descriptions of the data.

```
    For example: For the data at (0, 0): 
            sentence = January 20.7 Average Temperature °C
```

2. Using different approaches to converting the above sentence into meaningful sentences: 

> In order to generate more meaningful variations of the sentence mentioned above, I employ three approaches. All of these approaches are based on a common basic model, i.e. Text2Text model:


        a. Direct Text2Text: directly to generate alternative sentences by utilizing the Text2Text generation pipeline
        b. Bag Of Words: Adding some few extra words to the sentence for introducing variations and then feeding it into the text2text model
        c. Masking: Using Masking pipelines to add meaningful words to the original sentence then passed through the Text2Text generation model to generate alternative versions
           By masking specific parts of the sentence and replacing them with words can provide more context or specificity  


3. Making a new sentence by combining the generated sentences from the Text2Text model column-wise 


> After generating alternative sentences for each datapoint, I combine the generated sentences column-wise. By aligning the sentences corresponding to each column, I create a new sentence that incorporates information from all the columns. This combined sentence provides a comprehensive representation of the data, highlighting the relationship between different column values


4. Then utilizing summarizing pipelines to summarize the whole combined sentence:


> To summarize the entire combined sentence, I employ Summarizing pipelines. By the combined sentence and extracting the most important details, I generate a summarized version that captures the key aspects of the original information




---





# Installing required dependencies

In [1]:
# Install the transformers and datasets packages
!pip install transformers datasets

# Install the sentencepiece package
!pip install sentencepiece

# Install the transformers package with the sentencepiece extra
!pip install transformers[sentencepiece]


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# Import the necessary library
import pandas as pd

# Define the column names
column_name = [" ", "January", "Februray", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

# Define the data as a list of lists
data = [["Average Temperature °C", 20.7, 22.8, 25.4, 26.6, 25.5, 23.2, 22.4, 22.2, 22.4, 22.0, 20.9, 20.1],
        ["Lowest Temperature °C", 14.6, 16.1, 18.6, 20.8, 21.0, 20.2, 19.8, 19.4, 19.0, 18.4, 16.7, 15.1],
        ["Maximum Temperature °C", 27.4, 29.6, 32.1, 32.8, 31.2, 27.5, 26.4, 26.1, 26.7, 26.4, 25.7, 25.8],
        ["Rainfall mm", 4.0, 7.0, 16.0, 45.0, 131.0, 126.0, 134.0, 137.0, 125.0, 147.0, 65.0, 23.0],
        ["average Sunlight per day hour", 8.9, 9.8, 10.4, 10.6, 9.5, 6.8, 6.1, 5.7, 6.4, 7.2, 7.1, 7.5]]

# Creating a DataFrame 
df = pd.DataFrame(data, columns=column_name)
df

Unnamed: 0,Unnamed: 1,January,Februray,March,April,May,June,July,August,September,October,November,December
0,Average Temperature °C,20.7,22.8,25.4,26.6,25.5,23.2,22.4,22.2,22.4,22.0,20.9,20.1
1,Lowest Temperature °C,14.6,16.1,18.6,20.8,21.0,20.2,19.8,19.4,19.0,18.4,16.7,15.1
2,Maximum Temperature °C,27.4,29.6,32.1,32.8,31.2,27.5,26.4,26.1,26.7,26.4,25.7,25.8
3,Rainfall mm,4.0,7.0,16.0,45.0,131.0,126.0,134.0,137.0,125.0,147.0,65.0,23.0
4,average Sunlight per day hour,8.9,9.8,10.4,10.6,9.5,6.8,6.1,5.7,6.4,7.2,7.1,7.5


In [3]:
# Calculate the number of columns and rows of the dataset
columns = len(column_name)
rows = len(data)
print (rows, columns)

5 13


### 1. Creating a sentence for each datapoint, utilized the column and row descriptions of the data



In [4]:
# creating a list of sentences based on the DataFrame. Each sentence follows the format: Column name + Data value + Unit of data point + Row name. 
sentences = []

for col in range(1, columns):
    sen = []
    for row in range(rows):     
        # format of sentence: Column_name + datavalue + Unit of datapoint + Row_name 
        text = column_name[col] + " " + str(data[row][col]) + str(data[row][0].split()[-1]) + " " + " ". join(data[row][0].split()[:-1])
        sen.append(text)
    sentences.append(sen)

    
sentences[0]

['January 20.7°C Average Temperature',
 'January 14.6°C Lowest Temperature',
 'January 27.4°C Maximum Temperature',
 'January 4.0mm Rainfall',
 'January 8.9hour average Sunlight per day']

# Importing necessary models 

### a. Text2Text Model

In [5]:
from transformers import AutoModelWithLMHead, AutoTokenizer

# specify the model here
model_name = "mrm8488/t5-base-finetuned-common_gen" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

def gen_sentence(words, max_length=500):
  input_text = words
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0], skip_special_tokens=True)



### b. Masking pipeline 

In [6]:
from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### c. Summarizer Pipeline

In [7]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Defining necessary functions

In [8]:
# This function takes a list of sentences as input and combines them into larger sentences
# This function uses another function gen_sentence (text2text model function) to generate a sentence from a given input sentence

def combine_sentence(sentences):
    to_summarize = []
    
    # Iterate over the list of sentences
    for sentence_list in sentences:
        tmp = ''
        
        # Iterate over the sentence in each sentence_list
        for sentence in sentence_list:
            # Generate a sentence using the gen_sentence function and append it to tmp
            tmp += gen_sentence(sentence) + " "
        
        # Append the combined sentence to the to_summarize list
        to_summarize.append(tmp)
    
    return to_summarize


In [9]:
# This functin performs text summarization on the input sentences using a summarizer model

def summarize(to_summarize, epochs=1):
    import copy 
    
    # Create a deep copy of the input list to avoid modifying the original list
    toSummarize = copy.deepcopy(to_summarize)

    # Perform summarization for the specified number of epochs
    for epoch in range(epochs):

        # Iterate over the sentences in the toSummarize list
        for idx, sentn in enumerate(toSummarize):

            # Generate a summarized text for the current sentence using the summarizer model
            summarized_text = summarizer(sentn, max_length=int(2*len(sentn.split())), min_length=int(len(sentn.split())*1.5), do_sample=False)
            
            # Extract the summarized text from the generated output
            for dictionary in summarized_text:
                for key, value in dictionary.items():
                    # Update the current sentence with the summarized text
                    toSummarize[idx] = value
    
    return toSummarize


# 2. Different approaches to converting the sentence into meaningful sentences:

## a. Approch 1:



> Direct Text2Text: Using directly Text2Text generation pipeline



In [10]:
to_summarize = combine_sentence(sentences)
to_summarize

['the average temperature in the city is 20.7°C on a sunny day in January lowest temperature in a month of January is 14.6°C. a smoky winter day with temperatures of 27°C and a high of 68°C a 4.0mm rainfall in the mountains in winter with a high angle of the mountains average of 8.9hours of sunlight per day in January ',
 'average temperature in the mountains of the region of Februray is 22.8°C. a smoky tidal lake with low temperatures of 16.1°C on a sunny day in the morning on the coast of the island a swan is seen in the distance as the temperature is 29.6°C at the beach a smoky 7.0mm of rainfall falls on the rocky coast of the island of Februray average of 9.8hours of sunlight per day on a foggy day ',
 'Average temperature in the month of march is 25.4°C lowest temperature in a month is 18.6°C. a smoky cloudy day with temperatures of 32.1°C in the morning on a sunny day in march a smoky cloudy day with a maximum of 16.0mm of rainfall on the first day of march average of 10.4 hours 

In [11]:
toSummarize = summarize(to_summarize, 1)

Your max_length is set to 120, but your input_length is only 83. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)
Your max_length is set to 150, but your input_length is only 107. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=53)
Your max_length is set to 120, but your input_length is only 83. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)
Your max_length is set to 126, but your input_length is only 86. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=43)
You

In [12]:
toSummarize

['The average temperature in the city is 20.7°C on a sunny day in January. lowest temperature in a month of January is 14.6°C. a smoky winter day with temperatures of 27°C and a high of 68°C a 4.0mm rainfall in the mountains in winter with a high angle of the mountains average of 8.9hours of sunlight per day inJanuary. The average rainfall in January is 4.3mm.',
 'Average temperature in the mountains of the region of Februray is 22.8°C. Average of 9.8hours of sunlight per day on a foggy day. 7.0mm of rainfall falls on the rocky coast of the island. Average temperature at the beach is 29.6°C on a sunny day in the morning. The average temperature at night is 16.1°C and the average temperature on a rainy day is 14.7°C in the region. The island has a population of around 2,000 people.',
 'Average temperature in the month of march is 25.4°C lowest temperature in a month is 18.6°C a smoky cloudy day with temperatures of 32.1°C in the morning on a sunny day in march. Average of 16.0mm of rain

---



---

## b. Approch 2


> Bag Of Words: Adding extra words to the original sentence before using the Text2Text model. This enhances the sentence and generates more meaningful variations.



In [13]:
def BoW(sentence):
    words = sentence.split()

    # Define the bag of words
    bag_of_words = ["is", "the", "a"]

    # Add bag of words to the words
    bag_of_words_with_words = words[1:] + bag_of_words

    import random
    random.shuffle(bag_of_words_with_words)

    # Convert the bag of words into a sentence or phrase
    input_text = words[0] + " " + " ".join(bag_of_words_with_words)

    return input_text


In [14]:
to_summarize = []

# Iterate over each sentences list
for sentence_list in sentences:
    tmp = ''
    # Iterate over each sentence in the sentence_list
    for sentence in sentence_list:

        # Apply the BoW function to the sentence and text2text function
        modified_sentence = gen_sentence(gen_sentence(BoW(sentence)))

        # Append the modified sentence to the temporary string
        tmp += modified_sentence + " "

    # Append the temporary string to the to_summarize list
    to_summarize.append(tmp)


to_summarize

['the average temperature in the month of January is 20.7°C. the lowest temperature in the year is 14.6°C. the maximum temperature in January is 27.4°C. the a is 4.0mm and the rain is a bright blue sky the average time of day is 8.9 hours. ',
 'the average temperature in the city of februray is 22.8°C. the lowest temperature in the world is 16.1°C. the sea is calmer than the sea at a temperature of 29.6°C. the outskirts of the town of Februray were hit by 7.0mm of rainfall. the average time of day is 9.8 hours. ',
 'the average temperature in the month of march is 25.4°C. the lowest temperature in march was 18.6°C. the temperature is forecast to be the highest in years and will reach 32.1°C in march. the a is a sliver of rainfall of 16.0mm. the average time of day is 10.4hours. ',
 'the average temperature in the spring is 26.6°C. the lowest temperature in a month is 20.8°C. the maximum temperature in the spring is 32.8°C. the rainfall is expected to be around 45.0mm in the month of Ap

In [15]:
toSummarize = summarize(to_summarize, 1)

Your max_length is set to 90, but your input_length is only 69. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=34)
Your max_length is set to 102, but your input_length is only 83. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=41)
Your max_length is set to 98, but your input_length is only 76. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=38)
Your max_length is set to 108, but your input_length is only 81. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=40)
Your m

In [16]:
toSummarize

['The average temperature in January is 20.7°C. The lowest temperature in the year is 14.6°C and the maximum temperature is 27.4°C the a is 4.0mm and the rain is a bright blue sky. The average time of day is 8.9 hours and the average temperature is 20°C in January.',
 'The lowest temperature in the world is 16.1°C. The average time of day is 9.8 hours in Februray. The sea is calmer than the sea at a temperature of 29.6°C and the outskirts of the town were hit by 7.0mm of rainfall. The city of Febrursuray has an average temperature of 22.8°C, and the lowest temperature of the world was 16.2°C in the same city.',
 'The temperature is forecast to be the highest in years and will reach 32.1°C in march. The average temperature in the month of march is 25.4°C. The lowest temperature in march was 18.6°C and the average time of day is 10.4hours. The a is a sliver of rainfall of 16.0mm.',
 'The average temperature in the spring is 26.6°C. the lowest temperature in a month is 20.8°C and the maxi



---



---




# c. Approch 3


>  Masking: Employed Masking pipelines to introduce meaningful words into the original sentence


In [17]:
# creatng a list of all the words present in the dataset 
import numpy as np

data_words = []

# Iterate over each row in the data
for row in data:
    # Iterate over each item in the row
    for item in row:

        # if the item is a string
        if isinstance(item, str):
            # Convert the item to lowercase, split it into words and add them in data_words
            data_words.extend((item.lower()).split())

# Remove duplicate words
data_words = np.unique(data_words)

In [18]:
# This function perform the masking operations

def masked_sentence(Sentences):
   
    # Iterate over each sentences
    for sentence in Sentences:
        # Iterate over each sentence in the sentence_list
        for idx, sent in enumerate(sentence):

            # Get the length of the sentence
            length = len(sent.split())

            # Iterate over each position in the sentence
            for i in range(length):

                # Split the sentence into words
                words = (sent.lower()).split(" ")

                # Insert a [MASK] token
                words.insert(2*i, "[MASK]")
                sent = ' '.join(words)

                # Calling the mask function
                options = unmasker(sent)

                # Iterate over each option in options
                for option in options:
                    # Check if the option is alphabetic and not already present in the words or dataframe
                    if option['token_str'].isalpha() and option['token_str'] not in words and option['token_str'] not in data_words:
                        # Replace the [MASK] token with the selected option
                        sent = sent.replace('[MASK]', option['token_str'])
                        break
                    else:
                        # Remove the [MASK] token if no suitable option is found
                        sent = sent.replace('[MASK]', '')
                        sent = " ".join(sent.split())

            # Update the modified sentence in the sentence list
            sentence[idx] = sent

    # Return the modified Sentences list
    return Sentences


In [19]:
import copy

# Create a copy of the 'sentences' list
Sentences = copy.deepcopy(sentences)

# Iterate twice to perform masking and replacement
for i in range(2):
    # Call the 'masked_sentence' function 
    Sentences = masked_sentence(Sentences)

Sentences

[['january 20.7°c annual average very high temperature',
  'january 14.6°c lowest temperature',
  'january 27.4°c maximum temperature',
  'rain on january 4.0mm of rainfall',
  'on january 8.9hour average sunlight per full day'],
 ['in februray 22.8°c average mean annual temperature',
  'in februray 16.1°c lowest annual mean temperature',
  'februray 29.6°c maximum temperature',
  'in februray 7.0mm rainfall',
  'in februray 9.8hour average number of direct sunlight hours per full day'],
 ['march 25.4°c average temperature',
  'march 18.6°c lowest temperature',
  'march 32.1°c maximum temperature',
  'in march 16.0mm rainfall',
  'recorded on march 10.4hour times average minimum sunlight hours per full day'],
 ['april 26.6°c average temperature',
  'april 20.8°c lowest temperature',
  'april 32.8°c maximum temperature',
  'in april 45.0mm rainfall',
  'april 10.6hour of average direct sunlight hours per full day'],
 ['may 25.5°c above average mean sea surface temperature',
  'may 21.0°

In [20]:
to_summarize = combine_sentence(Sentences)
to_summarize

['a january is an average year with temperatures of 20.7°c. january was the lowest temperature since records began at 14.6°c. january is the month when the temperature is 27.4°c at its maximum. a smoky 4.0mm of rain falls on january in the city. a full moon on january with an average of 8.9hours of sunlight per day ',
 'the average temperature in februray is 22.8°c a year. the average temperature in februray is 16.1°c. the maximum temperature in februray is 29.6°c. a smoky saturday morning saw a hefty rainfall of 7.0mm in februray. the average number of hours of direct sunlight per day in februray is 9.8. ',
 'the average temperature in march is 25.4°c. march is the lowest temperature since the first day of the year at 18.6°c. march is the highest temperature since records began at 32.1°c. a smoky march with 16.0mm of rainfall the average daily hours of sunlight recorded on march are 10.4 hours per hour. ',
 'april is the month when the average temperature is 26.6°c. april is the month

In [21]:
toSummarize = summarize(to_summarize, 1)

Your max_length is set to 114, but your input_length is only 88. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)
Your max_length is set to 98, but your input_length is only 91. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)
Your max_length is set to 104, but your input_length is only 77. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=38)
Your max_length is set to 118, but your input_length is only 85. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=42)
Your 

In [22]:
toSummarize

[' january is the month when the temperature is 27.4°c at its maximum. january was the lowest temperature since records began at 14.6°c. a smoky 4.0mm of rain falls on january in the city. a full moon on January with an average of 8.9hours of sunlight per day. a January is an average year with temperatures of 20.7°C.',
 "The average temperature in februray is 22.8°c a year. The maximum temperature is 29.6°c. The average number of hours of direct sunlight per day is 9.8. A smoky saturday morning saw a hefty rainfall of 7.0mm in the city. The city's average temperature is 16.1°C.",
 'The average temperature in march is 25.4°c. march is the lowest temperature since the first day of the year at 18.6°C. The average daily hours of sunlight recorded on march are 10.4 hours per hour. a smoky march with 16.0mm of rainfall is the highest temperature since records began at 32.1°c and the lowest since the beginning of 2013.',
 ' april is the best time to visit the coast as temperatures reach a max