# Predict which Tweets are about real disasters and which ones are not

In [None]:
import os

**The OS module in Python provides functions for interacting with the operating system. OS comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality. The *os* and *os.path* modules include many functions to interact with the file system.**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load



# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
pip install nlpaug

**This python library helps you with augmenting nlp for your machine learning projects.**
*content source: https://neptune.ai/blog/data-augmentation-in-python*

**What is Data Augmentation?**
Data Augmentation is a technique that can be used to artificially expand the size of a training set by creating modified data from the existing one. It is a good practice to use DA if you want to prevent overfitting, or the initial dataset is too small to train on, or even if you want to squeeze better performance from your model.

Let’s make this clear, Data Augmentation is not only used to prevent overfitting. In general, having a large dataset is crucial for the performance of both ML and Deep Learning (DL) models. However, we can improve the performance of the model by augmenting the data we already have. It means that Data Augmentation is also good for enhancing the model’s performance.

In general, DA is frequently used when building a DL model. That is why throughout this article we will mostly talk about performing Data Augmentation with various DL frameworks. Still, you should keep in mind that you can augment the data for the ML problems as well.

You can augment:

1.Audio
2.Text
3.Images
4.Any other types of data

We will focus on image augmentations as those are the most popular ones. Nevertheless, augmenting other types of data is as efficient and easy. That is why it’s good to remember some common techniques which can be performed to augment the data.

**Data Augmentation techniques**

We can apply various changes to the initial data. For example, for **images** we can use:

1. Geometric transformations – you can randomly flip, crop, rotate or translate images, and that is just the tip of the iceberg
2. Color space transformations – change RGB color channels, intensify any color
3. Kernel filters – sharpen or blur an image 
4. Random Erasing – delete a part of the initial image
5. Mixing images – basically, mix images with one another. Might be counterintuitive but it works


**For text there are:**

1. Word/sentence shuffling
2. Word replacement – replace words with synonyms
3. Syntax-tree manipulation – paraphrase the sentence to be grammatically correct using the same words
   Other described in the article about Data Augmentation in NLP (https://neptune.ai/blog/data-augmentation-nlp)

For audio augmentation you can use:

1. Noise injection
2. Shifting
3. Changing the speed of the tape
4. And many more

Moreover, the greatest advantage of the augmentation techniques is that you may use all of them at once. Thus, you may get plenty of unique samples of data from the initial one.

In [None]:
pip install sklearn

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Source: Wikipedia

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as naf

from tqdm import tqdm

from nlpaug.util import Action
from sklearn.model_selection import train_test_split

## Loading and showing initial data

In [None]:
tweet= pd.read_csv('../input/nlpgettingstarted/train.csv')
test=pd.read_csv('../input/nlpgettingstarted/test.csv')

In [None]:
tweet

In [None]:
test

In [None]:
tweet.shape

In [None]:
tweet=tweet.drop(['keyword','location'],axis=1)
print(tweet)

## Class distribution

Let's check class distribution.

In [None]:
x=tweet.target.value_counts()
sns.barplot(x.index,x)
plt.gca().set_ylabel('samples')

From the barplot we can see that there is class distribution!!! Tweets with no disaster(0) is more than tweets with diaster(1).

## Data Augmentation

In [None]:
# model_type: word2vec, glove or fasttext
aug_w2v = naw.WordEmbsAug(
#     model_type='word2vec', model_path='../input/nlpword2vecembeddingspretrained/GoogleNews-vectors-negative300.bin',
    model_type='glove', model_path='../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt',
    action="substitute")

In [None]:
text = tweet.iloc[0]['text']
text

In [None]:
aug_w2v.aug_p=0.2
print("Augmented Text:")
for ii in range(5):
    augmented_text = aug_w2v.augment(text)
    print(augmented_text)

In [None]:
train,valid=train_test_split(tweet,test_size=0.15)
print('Shape of train',train.shape)
print("Shape of Validation ",valid.shape)

In [None]:
from sklearn.utils import shuffle
def augment_text(df,samples=300,pr=0.2):
    aug_w2v.aug_p=pr
    new_text=[]
    
    ##dropping samples from validation
    df_n=df[df.target==1].reset_index(drop=True)

    ## data augmentation loop
    for i in tqdm(np.random.randint(0,len(df_n),samples)):
        
            text = df_n.iloc[i]['text']
            augmented_text = aug_w2v.augment(text)
            new_text.append(augmented_text)
    
    
    ## dataframe
    new=pd.DataFrame({'text':new_text,'target':1})
    df=shuffle(df.append(new).reset_index(drop=True))
    return df

In [None]:
train = augment_text(train,samples=400)   ## change samples to 0 for no augmentation
tweet = train.append(valid).reset_index(drop=True)

In [None]:
df=pd.concat([tweet,test])

In [None]:
df

In [None]:
df.shape