## NLP Tutorial

NLP - or *Natural Language Processing* - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

In this tutorial we'll look at this competition's dataset, use a simple technique to process it, build a machine learning model, and submit predictions for a score!

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

### A quick look at our data

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [3]:
train_df[train_df["target"] == 0]["text"].values[1]

'I love fruits'

And one that is:

In [4]:
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

### Building vectors

The theory behind the model we'll build in this notebook is pretty simple: the words contained in each tweet are a good indicator of whether they're about a real disaster or not (this is not entirely correct, but it's a great place to start).

We'll use scikit-learn's `CountVectorizer` to count the words in each tweet and turn them into data our machine learning model can process.

Note: a `vector` is, in this context, a set of numbers that a machine learning model can work with. We'll look at one in just a second.

In [5]:
count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [6]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


The above tells us that:
1. There are 54 unique words (or "tokens") in the first five tweets.
2. The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

In [7]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])

## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text"])

### Our model

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a _linear_ connection. So let's build a linear model and see!

In [8]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

Let's test our model and see how well it does on the training data. For this we'll use `cross-validation` - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

The metric for this competition is F1, so let's use that here.

In [9]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.60387232, 0.57484457, 0.64485082])

The above scores aren't terrible! It looks like our assumption will score roughly 0.65 on the leaderboard. There are lots of ways to potentially improve on this (TFIDF, LSA, LSTM / RNNs, the list is long!) - give any of them a shot!

In the meantime, let's do predictions on our training set and build a submission for the competition.

In [10]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=None,
                solver='auto', tol=0.001)

In [11]:
sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

In [12]:
sample_submission["target"] = clf.predict(test_vectors)

In [13]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [14]:
sample_submission.to_csv("submission.csv", index=False)

In [15]:
print(train_df.columns)    
print('---')
if 0 in train_df['target']:
    print('true')
    
for i in range(0,len(train_df['target'])):
    if train_df['target'][i] == 0:
        print(f'index: {i} is 0')
    if i is 20:
        break
print('---')        
for i in range(0, len(train_df['location'])):
    if isinstance(train_df['location'][i], (list, str)):
        print(train_df['location'][i])
    if i is 40:
        break
print('---')
size=len(train_df['id'])

count_with_location=0
count_without_location=0
for i in range(0, size):
    if isinstance(train_df['location'][i], (list, str)) == 1 and train_df['target'][i] == 1:
        #print(train_df['id'][i])
        count_with_location+=1
    elif isinstance(train_df['location'][i], (list, str)) == 0 and train_df['target'][i] == 1:
        count_without_location+=1
print(count_with_location) # 2196
print(count_without_location) # 1075\
print('---')
count_with_keyword=0
count_without_keyword=0
for i in range(0, size):
    if train_df['target'][i]==1 and isinstance(train_df['keyword'][i],(list,str))==1:
        count_with_keyword+=1
    elif train_df['target'][i]==1 and isinstance(train_df['keyword'][i],(list,str))==0:
        count_without_keyword+=1
print(count_with_keyword, count_without_keyword) # 3229, 42

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')
---
true
index: 15 is 0
index: 16 is 0
index: 17 is 0
index: 18 is 0
index: 19 is 0
index: 20 is 0
---
Birmingham
Est. September 2012 - Bristol
AFRICA
Philadelphia, PA
London, UK
Pretoria
World Wide!!
Paranaque City
Live On Webcam
---
2196
1075
---
3229 42


In [16]:
correlation_matrix=train_df.corr(method='spearman')

In [17]:
print(correlation_matrix)

              id    target
id      1.000000  0.060789
target  0.060789  1.000000


In [18]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df = pd.DataFrame(train_df)

In [19]:
df['keyword'] = df['keyword'].fillna('missing').astype(str)

In [20]:
df['keyword_encoded'] = label_encoder.fit_transform(df['keyword'])

correlation_matrix = df[['target', 'keyword_encoded']].corr(method='pearson')
print(correlation_matrix)

                   target  keyword_encoded
target           1.000000         0.057669
keyword_encoded  0.057669         1.000000


In [26]:
a = [2, 3, 10]
print(np.arange(len(a)))

[0 1 2]


Now, in the viewer, you can submit the above file to the competition! Good luck!

In [31]:
import numpy as np

class MyKFold:
    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        
    def split(self, X):
        n_samples = len(X)
        indices = np.arange(n_samples)
        
        if self.shuffle:
            np.random.seed(self.random_state)
            np.random.shuffle(indices)
        
        fold_sizes = np.full(self.n_splits, n_samples // self.n_splits, dtype=int)
        # np.full(shape, fill_value, dtype=None)  shape과 dtype을 갖는 배열을 생성함. 이 배열의 모든 요소를 fill_value로 채움
        fold_sizes[:n_samples % self.n_splits] += 1
        
        current = 0
        for fold_size in fold_sizes:
            start, stop = current, current + fold_size
            # start는 현재 폴드의 시작 인덱스, stop은 현재 폴드의 끝 인덱스
            # fold_sizes는 각 폴드에 포함될 샘플의 수
            # 각 폴드의 크기를 fold_size 변수에 할당함
            test_indices = indices[start:stop]
            # indices 배열에서 현재 폴드에 해당하는 부분을 잘라내어 test_indices에 저장함
            # 예를 들어, indices가 [0, 1, 2, ..., 99]이고, start가 0, stop이 20이면 test_indices는 [0, 1, 2, ..., 19]가 됨
            train_indices = np.concatenate([indices[:start], indices[stop:]])
            # indices 배열에서 현재 폴드를 제외한 나머지 부분을 이어붙여 train_indices에 저장함
            # 예를 들어, start가 0, stop이 20이면 train_indices는 [20, 21, 22, ..., 99]가 됨
            yield train_indices, test_indices
            # 현재 폴드의 학습 인덱스와 테스트 인덱스를 생성기로 반환함
            # yield는 함수의 상태를 저장하고, 나중에 다시 호출되면 저장된 상태에서 계속 실행됨
            current = stop
            # 다음 폴드를 위한 현재 인덱스 갱신
            # current를 현재 폴드의 끝 인덱스로 갱신하여 다음 폴드의 시작 인덱스로 사용함
            
# 사용 예제
X = np.arange(20)  # 예시 데이터
kf = MyKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [ 8  5 11  3 18 16 13  2  9 19  4 12  7 10 14  6] TEST: [ 0 17 15  1]
TRAIN: [ 0 17 15  1 18 16 13  2  9 19  4 12  7 10 14  6] TEST: [ 8  5 11  3]
TRAIN: [ 0 17 15  1  8  5 11  3  9 19  4 12  7 10 14  6] TEST: [18 16 13  2]
TRAIN: [ 0 17 15  1  8  5 11  3 18 16 13  2  7 10 14  6] TEST: [ 9 19  4 12]
TRAIN: [ 0 17 15  1  8  5 11  3 18 16 13  2  9 19  4 12] TEST: [ 7 10 14  6]


In [69]:
def create_folds(X, y, k):
    kf = MyKFold(n_splits=k, shuffle=True, random_state=42)
    folds = list(kf.split(X))
    return folds

In [33]:
def precision_recall_f1(y_true, y_pred):
    # TP : 양성으로 예측한 것이 실제 양성인 경우
    tp = sum((y_pred == 1) & (y_true == 1))
    # FP : 양성으로 예측한 것이 실제 음성인 경우
    fp = sum((y_pred == 1) & (y_true == 0))
    # FN : 음성으로 예측한 것이 실제 양성인 경우
    fn = sum((y_pred == 0) & (y_true == 1))
    
    # Precision 계산
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    
    # Recall 계산
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    # F1 Score 계산
    f1_score = 2 * ((precision*recall) / (precision+recall)) if (precision + recall) > 0 else 0
    
    return precision, recall, f1_score

# 예시 데이터
y_true = np.array([0, 1, 1, 1, 0, 1, 0, 0, 1, 0])
y_pred = np.array([0, 1, 0, 1, 0, 1, 0, 1, 1, 0])

precision, recall, f1 = precision_recall_f1(y_true, y_pred)
print(f"Precision: {precision}, Recall: {recall}, F1 Score: {f1}")

Precision: 0.8, Recall: 0.8, F1 Score: 0.8000000000000002


In [34]:
def train_and_evaluate(clf, X, y, folds):
    scores = []
    for train_index, test_index in folds:
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        
        score = f1_score(y_test, y_pred)
        scores.append(score)
        
    return scores

In [35]:
def cross_val_score_manual(clf, X, y, k):
    folds = create_folds(X, y, k)
    scores = train_and_evaluate(clf, X, y, folds)
    return np.mean(scores), np.std(scores)

In [37]:
def kfold_cross_validation(model, data, labels, k):
    kf = MyKFold(n_splits = k, shuffle = True, random_state = 42)
    scores = []
    
    for train_index, test_index in kf.split(data):
        X_train, X_test = data[train_index], data[test_index]
        y_train, y_test = labels[train_index], labels[test_index]
        
        model.fit(X_train, y_train)
        
        predictions = model.predict(X_test)
        score = precision_recall_f1(y_test, predictions)
        scores.append(score)
        
    mean_score = np.mean(scores)
    return mean_score

In [38]:
train_df.isnull().sum()

id                    0
keyword               0
location           2533
text                  0
target                0
keyword_encoded       0
dtype: int64

In [42]:
df=train_df.dropna()

In [44]:
df.isnull().sum()

id                 0
keyword            0
location           0
text               0
target             0
keyword_encoded    0
dtype: int64

In [73]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\d', ' ', text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

df['cleaned_text'] = df['text'].apply(preprocess_text)
df['tokens'] = df['cleaned_text'].apply(word_tokenize)

# 디버깅 포인트: 결과 확인
print(df[['text', 'cleaned_text', 'tokens']].head())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


                                                 text  \
31  @bbcmtd Wholesale Markets ablaze http://t.co/l...   
32  We always try to bring the heavy. #metal #RT h...   
33  #AFRICANBAZE: Breaking news:Nigeria flag set a...   
34                 Crying out for more! Set me ablaze   
35  On plus side LOOK AT THE SKY LAST NIGHT IT WAS...   

                                         cleaned_text  \
31  bbcmtd wholesale markets ablaze http co lhyxeo...   
32  always try bring heavy metal rt http co yao e ...   
33  africanbaze breaking news nigeria flag set abl...   
34                                  crying set ablaze   
35  plus side look sky last night ablaze http co q...   

                                               tokens  
31  [bbcmtd, wholesale, markets, ablaze, http, co,...  
32  [always, try, bring, heavy, metal, rt, http, c...  
33  [africanbaze, breaking, news, nigeria, flag, s...  
34                              [crying, set, ablaze]  
35  [plus, side, look, sky, last, 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(df['cleaned_text']).toarray()

# 디버깅 포인트: TF-IDF 결과 확인
print(X.shape)


(5080, 5000)


In [75]:
from sklearn.model_selection import train_test_split

y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 디버깅 포인트: 데이터 분할 결과 확인
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)


(4064, 5000) (1016, 5000)
(4064,) (1016,)


In [76]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [77]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.78      0.89      0.83       580
           1       0.82      0.66      0.73       436

    accuracy                           0.79      1016
   macro avg       0.80      0.77      0.78      1016
weighted avg       0.79      0.79      0.79      1016



In [78]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"Cross-Validation F1 Scores: {scores}")
print(f"Mean F1 Score: {scores.mean()}")




Cross-Validation F1 Scores: [0.56278366 0.57065217 0.58366801 0.48909657 0.68157895]
Mean F1 Score: 0.5775558721928912


