1. Load the data from the csv file
2. check the label

- import *module*

In [None]:
import pandas as pd
import os

- set the *data directory*

In [None]:
data_dir = '../data/'
air_dir = os.path.join(data_dir, 'airb')
air_csv = os.path.join(air_dir, 'AirBNBReviews.csv')

### 1. Load the data from the csv file

- use `pd.read_csv()` function

In [None]:

airb_df = pd.read_csv(air_csv)

print(airb_df)
print(airb_df.info())

                 Genre                                             Review  \
0             Location  The location of this Airbnb was perfect, close...   
1          Cleanliness  The cleanliness of the Airbnb was outstanding,...   
2         Neighborhood  The neighborhood where this Airbnb is situated...   
3             Security  I felt completely safe and secure during my st...   
4     Pet Friendliness  They were so welcoming to my pet, it felt like...   
..                 ...                                                ...   
349               Host  Unfortunately, the host was unresponsive and l...   
350               Host  I experienced difficulties in reaching the hos...   
351               Host  The host was unaccommodating and did not adequ...   
352               Host  I felt unwelcome by the host, with minimal com...   
353               Host  The host was unhelpful and showed a lack of in...   

     Positive or Negative  
0                     1.0  
1                  

### 2. Check the label distribution

- use `.groupby()` function

In [None]:
label_distribution = airb_df.groupby(['Genre', 'Positive or Negative']).size()
data_distribution = airb_df.value_counts()

print(label_distribution)
print('value_counts:' , data_distribution)


Genre              Positive or Negative
 Cleanliness       0.0                     39
                   1.0                     19
 Host              0.0                     38
                   1.0                     18
 Location          0.0                     38
                   1.0                     18
 Neighborhood      0.0                     38
                   1.0                     19
 Pet Friendliness  0.0                     38
                   1.0                     19
 Security          0.0                     38
                   1.0                     19
dtype: int64
value_counts: Genre              Review                                                                                                                Positive or Negative
 Neighborhood      There were safety concerns in the neighborhood, making me feel uneasy during my stay.                                 0.0                     5
 Pet Friendliness  Unfortunately, the Airbnb did not allow p

### 3. Check the missing values

- use `.isnull()` function

In [None]:

missing_value = airb_df.isnull().sum()

print(missing_value)

label_missing_distribution = missing_value.value_counts()
print('value_counts:', label_missing_distribution)



Genre                   13
Review                  13
Positive or Negative    13
dtype: int64
value_counts: 13    3
dtype: int64


### 4. Convert to Token from Text

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
text_128 = airb_df['Review'][127]
text_255 = airb_df['Review'][255]
text_custom_1 = "It was beautiful! I want to go again." # Change the text_custom_1 to your own text
text_custom_2 = "Wow! this is terrible and very noisy!" # Change the text_custom_2 to your own text

In [None]:
print(text_128)
print(text_255)
print(text_custom_1)
print(text_custom_2)

I didn't feel comfortable leaving my belongings unattended in the Airbnb due to inadequate security. 
The neighborhood had a reputation for thefts, affecting my peace of mind. 
It was beautiful! I want to go again.
Wow! this is terrible and very noisy!


#### 1. Encode The Review Text

In [None]:

# tokenizer_input_128 = tokenizer(text_128)
# tokenizer_input_255 = tokenizer(text_255)
# tokenizer_input_c1 = tokenizer(text_custom_1)
# tokenizer_input_c2 = tokenizer(text_custom_2)
# tokenizer_inputs = [tokenizer_input_128, tokenizer_input_255, tokenizer_input_c1, tokenizer_input_c2]
# print(tokenizer_input_128)
# print(tokenizer_inputs)

# text_to_token = tokenizer.convert_ids_to_tokens(tokenizer_input_128["input_ids"])
# print(text_to_token)

{'input_ids': [101, 151, 30557, 112, 162, 23333, 66493, 13356, 20105, 11153, 49909, 10107, 10155, 23708, 20298, 10104, 10103, 11140, 10417, 55216, 10875, 10114, 20150, 10282, 61420, 15636, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[{'input_ids': [101, 151, 30557, 112, 162, 23333, 66493, 13356, 20105, 11153, 49909, 10107, 10155, 23708, 20298, 10104, 10103, 11140, 10417, 55216, 10875, 10114, 20150, 10282, 61420, 15636, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}, {'input_ids': [101, 10103, 32240, 10407, 143, 31514, 10139, 49601, 10107, 117, 56365, 10285, 11153, 16534, 10108, 15849, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [None]:

def tokenize_text(text):
    # text_to_token = tokenizer.convert_ids_to_tokens(text["input_ids"])
    text_to_token = tokenizer(text)
    return text_to_token


In [None]:
for text in [text_128, text_255, text_custom_1, text_custom_2]:

    print(text)
    text_to_token = tokenize_text(text)
    print(text_to_token)
    print('------------------')

I didn't feel comfortable leaving my belongings unattended in the Airbnb due to inadequate security. 
{'input_ids': [101, 151, 30557, 112, 162, 23333, 66493, 13356, 20105, 11153, 49909, 10107, 10155, 23708, 20298, 10104, 10103, 11140, 10417, 55216, 10875, 10114, 20150, 10282, 61420, 15636, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
------------------
The neighborhood had a reputation for thefts, affecting my peace of mind. 
{'input_ids': [101, 10103, 32240, 10407, 143, 31514, 10139, 49601, 10107, 117, 56365, 10285, 11153, 16534, 10108, 15849, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
------------------
It was beautiful! I want to go again.
{'input_ids': [101, 10197, 10140, 20524, 106, 151, 16612, 1

#### 2. Decode The Review Text from Token

In [None]:

def decode_token(token):
    token_to_text = tokenizer.decode(token["input_ids"])
    return token_to_text


In [None]:
for text in [text_128, text_255, text_custom_1, text_custom_2]:
    print(text)
    text_to_token = tokenize_text(text)
    print(text_to_token)
    token_to_text = decode_token(text_to_token)
    print(token_to_text)
    print('------------------')

I didn't feel comfortable leaving my belongings unattended in the Airbnb due to inadequate security. 
{'input_ids': [101, 151, 30557, 112, 162, 23333, 66493, 13356, 20105, 11153, 49909, 10107, 10155, 23708, 20298, 10104, 10103, 11140, 10417, 55216, 10875, 10114, 20150, 10282, 61420, 15636, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] i didn't feel comfortable leaving my belongings unattended in the airbnb due to inadequate security. [SEP]
------------------
The neighborhood had a reputation for thefts, affecting my peace of mind. 
{'input_ids': [101, 10103, 32240, 10407, 143, 31514, 10139, 49601, 10107, 117, 56365, 10285, 11153, 16534, 10108, 15849, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CL