Credits: https://www.kaggle.com/hoshi7/chaii-the-beginning-eda-wordclouds by @hoshi7
This work is an extension of the work above. 
Please upvote the original notebook if you find this useful.
Thanks

# Contents: 
1. [Importing Libraries and Data](#1) 
    1. [Necessary Functions](#necessary_functions)
3. [Understanding Data](#2)
3. [Exploratory Data Analysis](#3)
    1. [What is the number of tamil and hindi rows in the database?](#4) 
    2. [What are the most prevelant words for Hindi?](#5)
    3. [What are the most prevelant questions words for Hindi?](#6)
    4. [What are the most prevelant answers for Hindi](#7)

## [Importing Libraries and Data]()<a id="1"></a> <br>


In [None]:
#Basic libraries
import numpy as np
import pandas as pd

#EDA: 
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
import plotly.tools as tls
from plotly.offline import init_notebook_mode
## Wordclouds
import altair as alt
from  altair.vega import v5
from IPython.display import HTML
import json

#Basic Preprocessing
from collections import defaultdict, Counter

#Question-Answering
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

In [None]:
df = pd.read_csv('../input/chaii-hindi-and-tamil-question-answering/train.csv')

### [Necessary Functions]()<a id="necessary_functions"></a> <br>

In [None]:
# Defining functions for visualizations: 

def pie_plot(labels, values, colors, title):
    fig = {
      "data": [
        {
          "values": values,
          "labels": labels,
          "domain": {"x": [0, .48]},
          "name": "Job Type",
          "sort": False,
          "marker": {'colors': colors},
          "textinfo":"percent+label",
          "textfont": {'color': '#FFFFFF', 'size': 10},
          "hole": .6,
          "type": "pie"
        } ],
        "layout": {
            "title":title,
            "annotations": [
                {
                    "font": {
                        "size": 25,

                    },
                    "showarrow": False,
                    "text": ""

                }
            ]
        }
    }
    return fig

In [None]:
##-----------------------------------------------------------
# This whole section 
vega_url = 'https://cdn.jsdelivr.net/npm/vega@' + v5.SCHEMA_VERSION
vega_lib_url = 'https://cdn.jsdelivr.net/npm/vega-lib'
vega_lite_url = 'https://cdn.jsdelivr.net/npm/vega-lite@' + alt.SCHEMA_VERSION
vega_embed_url = 'https://cdn.jsdelivr.net/npm/vega-embed@3'
noext = "?noext"

paths = {
    'vega': vega_url + noext,
    'vega-lib': vega_lib_url + noext,
    'vega-lite': vega_lite_url + noext,
    'vega-embed': vega_embed_url + noext
}

workaround = """
requirejs.config({{
    baseUrl: 'https://cdn.jsdelivr.net/npm/',
    paths: {}
}});
"""

#------------------------------------------------ Defs for future rendering
def add_autoincrement(render_func):
    # Keep track of unique <div/> IDs
    cache = {}
    def wrapped(chart, id="vega-chart", autoincrement=True):
        if autoincrement:
            if id in cache:
                counter = 1 + cache[id]
                cache[id] = counter
            else:
                cache[id] = 0
            actual_id = id if cache[id] == 0 else id + '-' + str(cache[id])
        else:
            if id not in cache:
                cache[id] = 0
            actual_id = id
        return render_func(chart, id=actual_id)
    # Cache will stay outside and 
    return wrapped
            
@add_autoincrement
def render(chart, id="vega-chart"):
    chart_str = """
    <div id="{id}"></div><script>
    require(["vega-embed"], function(vg_embed) {{
        const spec = {chart};     
        vg_embed("#{id}", spec, {{defaultStyle: true}}).catch(console.warn);
        console.log("works?");
    }});
    console.log("recheck to see if it works?");
    </script>
    """
    return HTML(
        chart_str.format(
            id=id,
            chart=json.dumps(chart) if isinstance(chart, dict) else chart.to_json(indent=None)
        )
    )



HTML("".join((
    "<script>",
    workaround.format(json.dumps(paths)),
    "</script>")))

In [None]:
# Wordcloud function


def word_cloud(df, pixwidth=6000, pixheight=350, column="index", counts="count"):
    data= [dict(name="dataset", values=df.to_dict(orient="records"))]
    wordcloud = {
        "$schema": "https://vega.github.io/schema/vega/v5.json",
        "width": pixwidth,
        "height": pixheight,
        "padding": 0,
        "title": "Hover to see number of occureances from all the sequences",
        "data": data
    }
    scale = dict(
        name="color",
        type="ordinal",
        range=["cadetblue", "royalblue", "steelblue", "navy", "teal"]
    )
    mark = {
        "type":"text",
        "from":dict(data="dataset"),
        "encode":dict(
            enter=dict(
                text=dict(field=column),
                align=dict(value="center"),
                baseline=dict(value="alphabetic"),
                fill=dict(scale="color", field=column),
                tooltip=dict(signal="datum.count + ' occurrances'")
            )
        ),
        "transform": [{
            "type": "wordcloud",
            "text": dict(field=column),
            "size": [pixwidth, pixheight],
            "font": "Helvetica Neue, Arial",
            "fontSize": dict(field="datum.{}".format(counts)),
            "fontSizeRange": [10, 60],
            "padding": 2
        }]
    }
    wordcloud["scales"] = [scale]
    wordcloud["marks"] = [mark]
    
    return wordcloud



def wordcloud_create(df, field):
    ult = {}
    corpus = df[field].values.tolist()
    final = defaultdict(int) #Declaring an empty dictionary for count (Saves ram usage)
    for words in corpus:
        for word in words.split():
             final[word]+=1
    temp = Counter(final)
    for k, v in  temp.most_common(300):
        ult[k] = v
    corpus = pd.Series(ult) #Creating a dataframe from the final default dict
    return render(word_cloud(corpus.to_frame(name="count").reset_index(), pixheight=600, pixwidth=900))

## [Understanding Data]()<a id="2"></a> <br>


In [None]:
df.head()

From the columns, we can observe the following columns: 
- **id**: The unique id for that particular row. 
- **context**: The text in hindi/tamil from which the answer needs to be derived. 
- **question**: The question in the respective language
- **answer_text**: This is the text which signifies the answer. We are trying to predict this for the test set. As this is a text based competition, we will be using jaccard score to evaluate how closely related to the true answer it was. 
- **answer_start**: The starting character index in the context from where the answer begins. 
- **language**: The language of the context and question. 

## [Exploratory Data Analysis]()<a id="3"></a> <br>

#### [What is the number of tamil and hindi questions in the dataset?]()<a id="4"></a> <br>

In [None]:
value_counts = df['language'].value_counts()
labels = value_counts.index.tolist()
py.iplot(pie_plot(labels, value_counts, ['#1B9E77', '#D95F02'], "Language breakdown"))

#### [What are the most prevelant words for Hindi?]()<a id="5"></a> <br>

In [None]:
hindi_df = df[df['language']=='hindi']
wordcloud_create(hindi_df, 'context')

As I know hindi, I can look at the above wordcloud and figure out that: 
- The most common words, like English, are the stop words. 
- Punctuation is present for hindi dataset too, such as: , | ? - 


Based on these, we can formulate a plan for the next steps. 
Let's draw the same for Tamil and see if something can be inferred. 

As I know hindi, I can look at the above wordcloud and figure out that:

The most common words, like English, are the stop words.
**Punctuation is present for hindi dataset too, such as: , | ? -**

#### [What are the most prevelant question words for Hindi?]()<a id="6"></a> <br>

In [None]:
hindi_df = df[df['language']=='hindi']
wordcloud_create(hindi_df, 'question')

**Alongside punctuations, question words are present as well.**

#### [What are the most prevelant answers for Hindi?]()<a id="7"></a> <br>

In [None]:
hindi_df = df[df['language']=='hindi']
wordcloud_create(hindi_df, 'answer_text')

**Now, its apparent proper nouns are most likely to be present in answers than any other parts of speech.
Units of measurement ( months, numbers etc) also find place in answers.**