NNFL Project: Paper ID: 113
Contributors:
- Ateeksha Mittal 2017A8PS0431P
- Dishita Malav 2017A7PS0164P
- Shefali Tripathi 2017A7PS0139P
All notebooks uploaded have been written using Google's Colab. We would suggest to run them using the same.
- Implement Character-level Convolutional Network for Text Classification on the AG News dataset using PyTorch.
- Compare the results obtained with that of Word-Based Convolutional Network for Text Classification on the same dataset.
- Papers
The datasets being used to train and test could not be uploaded due to the size being larger than 25 MB. We have used two datasets:
Prepared by us, by cleaning the unclean AG News Corpus of around 2 million news samples obtained from http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.
Dataset Link: https://drive.google.com/drive/folders/1nUPhyFj164LnRKECFOcw8cpoIDShlqdP?usp=sharing
Prepared by one of the authors of the paper "Character-level Convolutional Networks for Text Classification", Xiang Zhang. We obtained this dataset from his personal website: http://xzh.me/
Dataset Link: https://drive.google.com/drive/folders/1vZ1agGTdHJDX455Vnl7Y1TW9eXqhUhWx?usp=sharing
Pre-trained embedding.
Dataset Link: https://drive.google.com/file/d/1jsieNbVR1h1o_bSuYMZvFMeUjFy1xiMa/view?usp=sharing
Note: Please open using your BITS email ID.
The notebook contains the code used to clean the AG News Corpus. Running this notebook results in the clean csv files being saved into drive folder, whose link we have provided. (We have already provided the cleaned files in said folder for ease)
This notebook contains the implementation of Character-Based Convolutional Networks used to classify news samples from the AG News Corpus.
This notebook contains the implementation of Word-Based Convolutional Networks used to classify news samples from the AG News Corpus.
- Our Dataset
- Author's Dataset
- Our Dataset
- Author's Dataset
- The models when run on the Author’s Dataset, predicts the classes of the random news samples correctly, on both large as well as small feature.
- On our Dataset, the models don’t perform as well, which makes sense when we compare the test accuracy achieved on both the datasets.
- Download the notebooks, Cleaning.ipynb, CharCNN.ipynb and WordCNN.ipynb., further, upload them onto your drive.
- Open the Notebooks with Google's Colab.
- Download the Pre-Trained Word Embeddings, upload it directly on your Google Drive.
- Create shortcuts of the Author's Dataset and Our Dataset in your Drive.
- You may run Cleaning.ipynb. (Not required, since we've already done that)
- You can run CharCNN.ipynb and WordCNN.ipynb on either our dataset or the author's dataset (While mounting the drive, please uncomment the path of the dataset you wish to use and comment the other) using the small features or large feature (While instantiating the model, please uncomment the function call you wish to use and comment the other)
- Run the notebooks by clicking on Runtime >> Run All.
Please open all the DataSet links provided using your BITS email ID.