# Task 1: Sentiment Analysis with Logistic Regression

## Overview
In this task, you will build a sentiment classification system to label movie reviews as either positive or negative. You are free to choose any model or feature extraction technique, but your implementation must use a scikit-learn pipeline.
Your solution will be tested on a hidden test set and must meet performance requirements to be considered valid.

In [1]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [2]:
# load data
container_path = "data/movie_reviews"
data = load_files(container_path)

In [3]:
# split data
# split the data and target variables
X, y = data.data, data.target

#split the dataset into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # random_state keeps data split identical for each run    

In [4]:
# cretae a pipeline for logistic regression, 
# add labels for process for quick access later on
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        lowercase = True,
        stop_words = 'english',
        max_features = 1000,
        ngram_range = (1, 2)
    )),
    ('clf', LogisticRegression(max_iter = 1000))
])

In [5]:
# train the model
pipeline.fit(X_train, y_train)

# get prediction for test set
y_pred = pipeline.predict(X_test)

In [6]:
# display the report
print(classification_report(y_test, y_pred, target_names=data.target_names))

              precision    recall  f1-score   support

         neg       0.78      0.78      0.78       235
         pos       0.79      0.79      0.79       245

    accuracy                           0.79       480
   macro avg       0.79      0.79      0.79       480
weighted avg       0.79      0.79      0.79       480



## Note:

1. The actual task was to train the model on whole data and return the pipeline. The pipeline was then tested (on hidden data) to check if it achives the acuuracy >= 80% with f1-score >= 80%
2. 