**Project: Using NLP for Requirements Engineering for Data Management Software**

Source data: Stackoverflow question and answers concerning Azure Data Factory

Tags: azure-data-factory, adf 

Title keywords: azure data factory, adf

Link to data source: https://data.stackexchange.com/stackoverflow/query/new

Link to generated schema: https://dbdiagram.io/d/6448f5b26b3194705139098b

Link to GitHub Repo: https://github.com/tanwolf/NLP_Requirements-Engineering


**Purposes of this notebook:** 
- Train a Logistic Regression Model to correctly identify topics belonging to topic0/ "Data Management in Pipelines". The topic was identified through Latent Dirichilet Allocation   

In [3]:
import pandas as pd

df = pd.read_csv('merged_adf_df_with_lda_topic1.csv', index_col=0)

In [4]:
'''from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Initialize a CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer and transform 'LemmatizedQuestionBody' into a feature matrix X
X = vectorizer.fit_transform(df['LemmatizedQuestionBody'])

# The target remains the same
y = df['LDATopic']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a logistic regression model
logreg = LogisticRegression(solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)

# Make predictions and print a classification report
y_pred = logreg.predict(X_test)
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)
''''


Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.71      0.73       412
          11       0.91      0.93      0.92      1298

    accuracy                           0.87      1710
   macro avg       0.83      0.82      0.82      1710
weighted avg       0.87      0.87      0.87      1710



In [5]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

# Same setup as before...

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix:
[[ 291  121]
 [  93 1205]]


In [6]:
# Feature importance
feature_names = vectorizer.get_feature_names_out()
coefs_with_fns = sorted(zip(logreg.coef_[0], feature_names))
top = zip(coefs_with_fns[:10], coefs_with_fns[:-(10 + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
    print("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))

	-1.8046	dataflows      		1.9717	doesnt         
	-1.7123	dataflow       		1.5091	myexpression   
	-1.5817	following      		1.5038	datafactoryname
	-1.3450	guidance       		1.4720	mydate         
	-1.3017	annotation     		1.2337	complete       
	-1.2582	compression    		1.1317	someid         
	-1.2522	flatten        		1.1246	md             
	-1.2506	regex          		1.1213	connect        
	-1.2346	compare        		1.0887	application    
	-1.1564	fed            		1.0486	sap            


In [7]:
# Cross-validation
scores = cross_val_score(logreg, X, y, cv=5)
print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", scores.mean())

Cross-validation scores:  [0.82807018 0.81461988 0.83382095 0.76711527 0.812756  ]
Average cross-validation score:  0.8112764552301371


In [8]:
from sklearn.metrics import accuracy_score

# Make predictions on the training set
y_train_pred = logreg.predict(X_train)

# Print a classification report for the training set
print(classification_report(y_train, y_train_pred))

# Compute the accuracy on the training set
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Training Accuracy: ", train_accuracy)

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1685
          11       0.99      1.00      0.99      5152

    accuracy                           0.99      6837
   macro avg       0.99      0.98      0.99      6837
weighted avg       0.99      0.99      0.99      6837

Training Accuracy:  0.9903466432645898


***GPT4 prompt:*** Summarize and anylze the results:

***GPT4 result:*** The Logistic Regression model is used to classify the questions into two categories, class 0 and class 11.

    Classification Report: The classification report shows that the model has an overall accuracy of 87%, which is quite good. The precision, recall, and f1-score for class 0 are 76%, 71%, and 73% respectively. For class 11, these metrics are 91%, 93%, and 92%. This shows that the model is performing better on class 11 than class 0, which is similar to the results obtained from the previous SVM model.

    Confusion Matrix: The confusion matrix provides similar insights. The model correctly identified 291 instances of class 0, but misclassified 121 as class 11. Similarly, the model correctly identified 1205 instances of class 11 but misclassified 93 as class 0. This confirms that the model is performing better on class 11 than class 0.

    Feature Importance: The model provides coefficients that correspond to the importance of each feature. The most influential words for class 0 seem to be "dataflows", "dataflow", "following", and others, whereas for class 11, the words "doesnt", "myexpression", "datafactoryname", and others seem to be the most influential.

    Cross-validation scores: The cross-validation scores are fairly consistent, ranging from about 0.76 to 0.83, with an average of 0.81. This suggests that the model's performance is robust and not overly dependent on the particular division of the data into training and test sets.

In conclusion, the Logistic Regression model performs well with an average accuracy of 81% in cross-validation. As with the previous SVM model, this model performs better on class 11 than on class 0. The feature importance analysis gives an insight into which words are most influential in classifying a question into each class. To improve the model's performance, especially for class 0, it may be beneficial to look into feature engineering or fine-tuning the model's hyperparameters.