**Project: Applying NLP in Requirements Engineering: Exploring Data Management Challenges in Azure Data Factory**

Source data: Stackoverflow question and answers concerning Azure Data Factory

Tags: azure-data-factory, adf 

Title keywords: azure data factory, adf

Link to data source: https://data.stackexchange.com/stackoverflow/query/new

Link to generated schema: https://dbdiagram.io/d/6448f5b26b3194705139098b

Link to GitHub Repo: https://github.com/tanwolf/NLP_Requirements-Engineering


**Purposes of this notebook:** 
- Train a Support Vector Machine Model to correctly identify topics belonging to topic0/ "Data Management in Pipelines". The topic was identified through Latent Dirichilet Allocation   

In [1]:
import pandas as pd

df = pd.read_csv('merged_adf_df_with_lda_topic1.csv', index_col=0)

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['LemmatizedQuestionBody']).toarray()
Y = df['LDATopic']

In [4]:
'''from sklearn import svm

# Training the model
svm_model = svm.SVC()
svm_model.fit(X_train, Y_train)

# Evaluating the model
Y_pred_svm = svm_model.predict(X_test)
print(classification_report(Y_test, Y_pred_svm))'''

              precision    recall  f1-score   support

           0       0.82      0.61      0.70       412
          11       0.89      0.96      0.92      1298

    accuracy                           0.87      1710
   macro avg       0.85      0.78      0.81      1710
weighted avg       0.87      0.87      0.87      1710



In [5]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(Y_test, Y_pred_svm))

Confusion Matrix:
[[ 251  161]
 [  54 1244]]


In [6]:
# Perform cross-validation
scores = cross_val_score(svm_model, X, Y, cv=5)
print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", scores.mean())

Cross-validation scores:  [0.83508772 0.82105263 0.84259801 0.79169105 0.83382095]
Average cross-validation score:  0.8248500713457136


In [7]:
# Inspect support vectors
print("Support vectors:")
print(svm_model.support_vectors_)

Support vectors:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [8]:
# Inspect coefficients for support vectors
print("Coefficients for support vectors:")
print(svm_model.dual_coef_)

Coefficients for support vectors:
[[-0.94367023 -1.         -0.5883715  ...  0.89119695  1.
   0.44994965]]


***GPT4 prompt:*** Summarize and anylze the results:

***GPT4 result:*** The SVM model is used to classify the questions into two categories, class 0 and class 11.

    Classification Report: The classification report shows that the model has an overall accuracy of 87%, which is quite good. The precision, recall and f1-score for class 0 are 82%, 61% and 70% respectively. For class 11, these metrics are 89%, 96% and 92%. This shows that the model is performing better on class 11 than class 0. The model has a high recall for class 11, meaning it is correctly identifying most of the positive class 11 instances. However, for class 0, the recall is lower indicating that the model is missing a fair amount of positive class 0 instances.

    Confusion Matrix: The confusion matrix provides similar information. The model correctly identified 251 instances of class 0, but misclassified 161 as class 11. Similarly, the model correctly identified 1244 instances of class 11, but misclassified 54 as class 0. This again shows that the model is performing better on class 11 than class 0.

    Cross-validation scores: The cross-validation scores are fairly consistent, ranging from about 0.79 to 0.84, with an average of 0.82. This suggests that the model's performance is robust and not overly dependent on the particular division of the data into training and test sets.

    Support vectors: The support vectors represent the data points that are closest to the decision boundary. These are the data points that are most difficult to classify and have the most influence on the decision boundary.

    Coefficients for support vectors: The coefficients for the support vectors determine the margin and position of the hyperplane in the SVM model. They are associated with each support vector and can be positive or negative. Positive coefficients push the hyperplane away from the corresponding support vector, while negative ones pull it closer.

In conclusion, your SVM model seems to perform reasonably well with an average accuracy of 82% in cross-validation. The model is, however, more accurate for class 11 than for class 0. It might be beneficial to explore ways to improve the model's performance for class 0, perhaps by adjusting class weights, trying different kernels, or using more advanced techniques for dealing with imbalanced classes.

In [11]:
from sklearn.metrics import accuracy_score

# Make predictions on the training set
Y_train_pred_svm = svm_model.predict(X_train)

# Print a classification report for the training set
print(classification_report(Y_train, Y_train_pred_svm))

# Compute the accuracy on the training set
train_accuracy_svm = accuracy_score(Y_train, Y_train_pred_svm)
print("Training Accuracy for SVM: ", train_accuracy_svm)

              precision    recall  f1-score   support

           0       0.99      0.95      0.97      1685
          11       0.98      1.00      0.99      5152

    accuracy                           0.98      6837
   macro avg       0.99      0.97      0.98      6837
weighted avg       0.98      0.98      0.98      6837

Training Accuracy for SVM:  0.9840573350884891


***GPT4:*** 

Based on the provided results:

In the classification report for the test data, the SVM model achieves an accuracy of 0.87. When we look at the precision, recall, and F1-score metrics for the individual classes (0 and 11), we see that the model performs better on class 11 than on class 0. This suggests that the model might have difficulties distinguishing class 0 instances.

In the classification report for the training data, the SVM model achieves an accuracy of 0.98, which is higher than that of the test data. Again, the model performs slightly better on class 11 than on class 0, but the difference is less pronounced than in the test data.

The accuracy on the training set is higher than the accuracy on the test set. This difference could indicate some degree of overfitting, as the model seems to perform better on the data it was trained on than on new data. However, the difference is not extremely large, so it might not be a major concern.

In summary, the SVM model performs fairly well, but it might have difficulties distinguishing instances of class 0. There might be some degree of overfitting, but it's not clear whether this is a major issue based on the provided results alone. Other diagnostic tools, such as learning curves or more detailed error analysis, could provide more insights into the model's performance and potential issues.