# Leitura dos Dados

Inicialmente iremos realizar a leitura dos dados que serão utilizados para pergunta e resposta. Estamos considerando como other a intenção de Q&A.

In [1]:
import numpy as np
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
qa_data = pd.read_csv("q&a_intent_train.csv", names= ["target", "text"])

qa_data

Unnamed: 0,target,text
0,other,What is the principle behind flight?
1,other,What are the four forces acting on an airplane?
2,other,What is the difference between IFR and VFR?
3,other,What is a black box in aviation?
4,other,What is the busiest airport in the world by pa...
...,...,...
94,other,"What is ""ACARS""?"
95,other,"What is ""Alternate Airport""?"
96,other,"at does ""pan-pan"" mean?"
97,other,"What is ""decision height"" (DH)?"


Leitura dos dados de treino e teste disponibilizados pelo ATIS dataset.

In [3]:
atis_train_data = pd.read_csv("atis_intents_train.csv", names= ["target", "text"])
atis_test_data = pd.read_csv("atis_intents_test.csv", names= ["target", "text"])

print("ATIS train dataset size is:", len(atis_train_data))
print("ATIS test dataset size is:", len(atis_test_data))

ATIS train dataset size is: 4834
ATIS test dataset size is: 800


Iremos construir um dataset para treino e teste considerando os dados que temos até então. Para tanto, separaremos qa_data em treino e teste (considerando por volta de 20% para teste) e depois construíremos um dataset para treino e um para teste unindo as tabelas até então existentes.

In [4]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
qa_train_data, qa_test_data = train_test_split(qa_data, test_size=0.20, random_state=42)

print("Q&A train dataset size is:", len(qa_train_data))
print("Q&A test dataset size is:", len(qa_test_data))

Q&A train dataset size is: 79
Q&A test dataset size is: 20


In [5]:
train_data = pd.concat([qa_train_data, atis_train_data], ignore_index=True)

train_data

Unnamed: 0,target,text
0,other,"What is a ""taxiway""?"
1,other,"What is a ""slot-restricted"" airport?"
2,other,"What is ""NextGen"" in U.S. aviation?"
3,other,"What does the term ""gate hold"" mean in aviation?"
4,other,"What does ""direct flight"" mean as opposed to ""..."
...,...,...
4908,atis_airfare,what is the airfare for flights from denver t...
4909,atis_flight,do you have any flights from denver to baltim...
4910,atis_airline,which airlines fly into and out of denver
4911,atis_flight,does continental fly from boston to san franc...


In [6]:
test_data = pd.concat([qa_test_data, atis_test_data], ignore_index=True)

test_data

Unnamed: 0,target,text
0,other,"What is ""yaw"" in aviation?"
1,other,"What is a ""deadhead"" flight?"
2,other,"What is ""Alternate Airport""?"
3,other,What is the purpose of ailerons on an aircraft?
4,other,"What is ""decision height"" (DH)?"
...,...,...
815,atis_flight,please find all the flights from cincinnati t...
816,atis_flight,find me a flight from cincinnati to any airpo...
817,atis_flight,i'd like to fly from miami to chicago on amer...
818,atis_flight,i would like to book a round trip flight from...


# Preprocessing Data

In [7]:
import spacy

# Load the English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")



In [8]:
def preprocess_text(doc):
    """
    Preprocess a single document.
    - `doc`: a string containing a document to preprocess.
    Returns a preprocessed version of the document.
    """
    # Parse the document with spaCy
    parsed_doc = nlp(doc)
    
    # Lemmatization and removing stopwords, punctuation, and spaces
    # You can customize this part as needed
    processed_tokens = [token.lemma_.lower() for token in parsed_doc if not token.is_stop and not token.is_punct and not token.is_space]
    
    # Rejoin processed tokens into a single string
    return ' '.join(processed_tokens)

In [9]:
train_data['processed_text'] = train_data['text'].apply(preprocess_text)
train_data.head()

Unnamed: 0,target,text,processed_text
0,other,"What is a ""taxiway""?",taxiway
1,other,"What is a ""slot-restricted"" airport?",slot restrict airport
2,other,"What is ""NextGen"" in U.S. aviation?",nextgen u.s. aviation
3,other,"What does the term ""gate hold"" mean in aviation?",term gate hold mean aviation
4,other,"What does ""direct flight"" mean as opposed to ""...",direct flight mean oppose non stop flight


In [10]:
test_data['processed_text'] = test_data['text'].apply(preprocess_text)
test_data.head()

Unnamed: 0,target,text,processed_text
0,other,"What is ""yaw"" in aviation?",yaw aviation
1,other,"What is a ""deadhead"" flight?",deadhead flight
2,other,"What is ""Alternate Airport""?",alternate airport
3,other,What is the purpose of ailerons on an aircraft?,purpose aileron aircraft
4,other,"What is ""decision height"" (DH)?",decision height dh


# Bag of Words

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
vectorizer = CountVectorizer()
vectorizer.fit(train_data['processed_text'])
X_train_bow = vectorizer.fit_transform(train_data['processed_text'])

In [13]:
X_test_bow = vectorizer.transform(test_data['processed_text'])

In [14]:
X_train_bow

<4913x721 sparse matrix of type '<class 'numpy.int64'>'
	with 27309 stored elements in Compressed Sparse Row format>

In [15]:
X_test_bow

<820x721 sparse matrix of type '<class 'numpy.int64'>'
	with 4481 stored elements in Compressed Sparse Row format>

# SVM

## Define and Train the SVM Classifier

In [16]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Initialize the SVM classifier with a linear kernel
svm_classifier = SVC(kernel='linear', random_state=42)

# Train the classifier using the BoW features from the training data
svm_classifier.fit(X_train_bow, train_data['target'])

## Predict on the Test Data and Evaluate the Model

In [17]:
# Predict the target values for the test data
y_test_pred = svm_classifier.predict(X_test_bow)

# Evaluate the predictions against the actual target values from the test data
print("Accuracy on test data:", accuracy_score(test_data['target'], y_test_pred))
print("\nClassification Report on test data:\n", classification_report(test_data['target'], y_test_pred))


Accuracy on test data: 0.9585365853658536

Classification Report on test data:
                      precision    recall  f1-score   support

  atis_abbreviation       0.93      0.79      0.85        33
      atis_aircraft       0.60      1.00      0.75         9
       atis_airfare       0.94      0.94      0.94        48
       atis_airline       0.86      0.95      0.90        38
        atis_flight       0.99      0.98      0.98       632
   atis_flight_time       0.50      1.00      0.67         1
atis_ground_service       1.00      1.00      1.00        36
      atis_quantity       0.00      0.00      0.00         3
              other       0.67      0.70      0.68        20

           accuracy                           0.96       820
          macro avg       0.72      0.82      0.75       820
       weighted avg       0.96      0.96      0.96       820



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [18]:
def predict_intention(text):
    # Preprocess the text using spaCy or any other preprocessing steps you have
    preprocessed_text = preprocess_text(text)  # Assuming preprocess_text is your custom preprocessing function
    
    # Transform the preprocessed text into BoW format using the same vectorizer
    text_bow = vectorizer.transform([preprocessed_text])
    
    # Predict the intention using the trained SVM classifier
    predicted_intention = svm_classifier.predict(text_bow)
    
    # Return the predicted intention
    return predicted_intention[0]

# Example usage
example_text = "What are the four forces acting on an airplane?"
predicted_intention = predict_intention(example_text)
print(f"The predicted intention for '{example_text}' is '{predicted_intention}'.")


The predicted intention for 'What are the four forces acting on an airplane?' is 'other'.
