# 1. Introduction

Bengali is the 5th most spoken language in the world with hundreds of million of speakers. It’s the official language of Bangladesh and the second most spoken language in India. Considering its reach, there’s significant business and educational interest in developing AI that can optically recognize images of the language handwritten. This challenge hopes to improve on approaches to Bengali recognition.

Optical character recognition is particularly challenging for Bengali. While Bengali has 49 letters (to be more specific 11 vowels and 38 consonants) in its alphabet, there are also 18 potential diacritics, or accents. This means that there are many more graphemes, or the smallest units in a written language. The added complexity results in ~13,000 different grapheme variations (compared to English’s 250 graphemic units).

Bangladesh-based non-profit Bengali.AI is focused on helping to solve this problem. They build and release crowdsourced, metadata-rich datasets and open source them through research competitions. Through this work, Bengali.AI hopes to democratize and accelerate research in Bengali language technologies and to promote machine learning education.

For this competition, you’re given the image of a handwritten Bengali grapheme and are challenged to separately classify three constituent elements in the image: grapheme root, vowel diacritics, and consonant diacritics.

 


# 2. Data Description

This dataset contains images of individual hand-written Bengali characters. Bengali characters (graphemes) are written by combining three components: a grapheme_root, vowel_diacritic, and consonant_diacritic. Your challenge is to classify the components of the grapheme in each image. There are roughly 10,000 possible graphemes, of which roughly 1,000 are represented in the training set. The test set includes some graphemes that do not exist in train but has no new grapheme components. It takes a lot of volunteers filling out sheets like this to generate a useful amount of real data; focusing the problem on the grapheme components rather than on recognizing whole graphemes should make it possible to assemble a Bengali OCR system without handwriting samples for all 10,000 graphemes.

## 2.1 Files

**train.csv**

* `image_id`: the foreign key for the parquet files
* `grapheme_root`: the first of the three target classes
* `vowel_diacritic`: the second target class
* `consonant_diacritic`: the third target class
* `grapheme`: the complete character. Provided for informational purposes only, you should not need to use this.

**test.csv**

Every image in the test set will require three rows of predictions, one for each component. This csv specifies the exact order for you to provide your labels. - `row_id`: foreign key to the sample submission - `image_id`: foreign key to the parquet file - `component`: the required target class for the row (grapheme_root, vowel_diacritic, or consonant_diacritic)

**sample_submission.csv**

`row_id`: foreign key to test.csv
`target`: the target column

**(train/test).parquet**

Each parquet file contains tens of thousands of 137x236 grayscale images. The images have been provided in the parquet format for I/O and space efficiency. Each row in the parquet files contains an `image_id` column, and the flattened image.

**class_map.csv**

Maps the class labels to the actual Bengali grapheme components.

# 3. Peek to the input Folder

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# 4. Fetching Data

In [None]:
train = pd.read_csv('/kaggle/input/bengaliai-cv19/train.csv')
test = pd.read_csv('/kaggle/input/bengaliai-cv19/test.csv')
sample = pd.read_csv('/kaggle/input/bengaliai-cv19/sample_submission.csv')
class_map = pd.read_csv('/kaggle/input/bengaliai-cv19/class_map.csv')


In [None]:
print('Size of train data', train.shape)
print('Size of test data', test.shape)
print('Size of sample submission', sample.shape)
print('Size of Class Map: ', class_map.shape)

## 4.2  Peek at the data

### Train Dataframe

In [None]:
train.head()

In [None]:
train.columns

In [None]:
train.describe()

### Test Dataframe

In [None]:
test.head()

In [None]:
test.columns

In [None]:
test.describe()

### samble_submission

In [None]:
sample.head()

### Class Map

In [None]:
class_map.head()

### Image Data

Image Data in in parquet files and contrain grayscale images of below mentioned dimentions. If you want to read more about this file format then try: https://acadgild.com/blog/parquet-file-format-hadoop Note that the file it self conatins values of all the 32332 pixels (137*236) in each row coresponding to a image.

`Image Height = 137`

`Image Width = 236`


### Image Utils

In [None]:
HEIGHT = 137
WIDTH = 236

def load_images(file):
    df = pd.read_parquet(file)
    return df.iloc[:, 1:].values.reshape(-1, HEIGHT, WIDTH)

In [None]:
## loading one of the parquest file for analysis
dummy_images = load_images('/kaggle/input/bengaliai-cv19/train_image_data_0.parquet')
print("Shape of loaded files: ", dummy_images.shape)
print("Number of images in loaded files: ", dummy_images.shape[0])
print("Shape of first loaded image: ", dummy_images[0].shape)
print("\n\nFirst image looks like:\n\n", dummy_images[0])

### Plotting image

In [None]:
import seaborn as sb
import matplotlib.pyplot as plt

## View the pixel values as image
plt.imshow(dummy_images[10], cmap='Greys')

#### plotting more images for better intution

In [None]:
f, ax = plt.subplots(6, 6, figsize=(16, 10))

for i in range(6):
    for j in range(6):
        ax[i][j].imshow(dummy_images[i*6+j], cmap='Greys')


## 4.3. Checking for Null Values

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
class_map.isnull().sum()

#### Luckily there are no null values in this competition, Bye! Bye! Imputation

## 4.4 Checking for class distribution

In [None]:
import seaborn as sns


sns.catplot(x='vowel_diacritic',data=train,kind="count", height=8.27, aspect=11.7/8.27)

In [None]:
sns.catplot(x='consonant_diacritic',data=train,kind="count", height=8.27, aspect=11.7/8.27)


In [None]:
sns.catplot(x='grapheme_root',data=train,kind="count", height=8.27, aspect=30/8.27)


In [None]:
print("Unique Grapheme-Root in train data: ", train.grapheme_root.nunique())
print("Unique Vowel-Diacritic in train data: ", train.vowel_diacritic.nunique())
print("Unique Consonant-Diacritic in train data: ", train.consonant_diacritic.nunique())
print("Unique Grapheme (Combination of three) in train data: ", train.grapheme.nunique())

Since I don't want this kernel notebook to get heavy, I am experimenting model related tasks in another notebook.

[Please visit this kernel for modelling ](https://www.kaggle.com/rohitsingh9990/bengaliai-starter-eda-multi-output-densenet/edit)

### If you find this kernel usefull, Do upvote.
