# Project Overview:
The objective of this project is to analyze the sentiment of movie reviews in three different languages - English, French, and Spanish. We have been given 30 movies, 10 in each language, along with their reviews and synopses in separate CSV files named movie_reviews_eng.csv, movie_reviews_fr.csv, and movie_reviews_sp.csv.

* The first step of this project is to read data 
from all the .csv files and create a single pandas dataframe. This dataframe should have the following columns - Title, Year, Synopsis, Review, and Original Language.
* The next step is to convert the French and Spanish reviews and synopses into English. This will allow us to analyze the sentiment of all reviews in the same language. We will be using pre-trained transformers from HuggingFace to achieve this task.
* Finally, we will use pretrained transformers from HuggingFace to analyze the sentiment of each review. The sentiment analysis results (Positive or Negative) will be added to the dataframe in a new column called Sentiment.
The output of the project will be a CSV file with a header row that includes column names such as Title, Year, Synopsis, Review, Sentiment, and Original Language. The Original Language column will indicate the language of the review and synopsis (en/fr/sp) before translation. The dataframe will consist of 30 rows, with each row corresponding to a movie.

###Tools used:
* Pandas: for data manipulation and analysis
* HuggingFace Transformers: for natural language processing tasks, such as translation and sentiment analysis
* PyTorch: for building and training machine learning models

###Skills Mastered:
* Data cleaning and manipulation using Pandas
* Natural language processing techniques, such as translation and sentiment analysis, using HuggingFace Transformers
* Building and training machine learning models using PyTorch Integration of multiple tools and libraries to solve a complex problem
* Overall, this project **data manipulation**,  **natural language processing**, and  **machine learning techniques** to perform multiple tasks on a dataset.

In [1]:
# imports
import pandas as pd
!pip install transformers
!pip install sentencepiece
from transformers import MarianMTModel, MarianTokenizer
from transformers import pipeline

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
def preprocess_data(file_path, name1, name2, name3):
    """
    Reads movie data from .csv files, map column names, add the "Original Language" column,
    and finally concatenate in one resultant dataframe called "df".
    """
    #file path
    dir1 = f"{file_path}{name1}.csv"
    dir2 = f"{file_path}{name2}.csv"
    dir3 = f"{file_path}{name3}.csv"
    
    # Read CSV files
    movie_reviews_name1 = pd.read_csv(dir1)
    movie_reviews_name2 = pd.read_csv(dir2)
    movie_reviews_name3 = pd.read_csv(dir3)
    
    # get columns names
    list1=movie_reviews_name1.columns
    list2=movie_reviews_name2.columns
    list3=movie_reviews_name3.columns

    # rename the columns names to be the same as acquired 

    movie_reviews_name1= movie_reviews_name1.rename(columns={list1[0]:'Title',
                                                             list1[1]:'Year',
                                                             list1[2]:'Synopsis',
                                                             list1[3]:'Review'})

    movie_reviews_name2 = movie_reviews_name2.rename(columns={list2[0]: 'Title',
                                                                list2[1]: 'Year',
                                                                list2[2]: 'Synopsis',
                                                                list2[3]: 'Review'})
    movie_reviews_name3=movie_reviews_name3.rename(columns={list3[0]: 'Title',
                                                            list3[1]: 'Year',
                                                            list3[2]: 'Synopsis',
                                                            list3[3]: 'Review'})

    # Add a column to each dataframe indicating the original language
    movie_reviews_name1["Original Language"] = f"{name1}"
    movie_reviews_name2["Original Language"] = f"{name2}"
    movie_reviews_name3["Original Language"] = f"{name3}"

    # Combine dataframes
    movie_reviews = pd.concat([movie_reviews_name1, movie_reviews_name2, movie_reviews_name3], ignore_index=True)
    
    return movie_reviews,movie_reviews_name1,movie_reviews_name2,movie_reviews_name3


In [3]:
# instantiate the file path
file_path = "/content/movie_reviews_"

df = preprocess_data(file_path, "eng", "fr", "sp")[0]
print(len(df))
df.sample(10)

30


Unnamed: 0,Title,Year,Synopsis,Review,Original Language
16,La Tour Montparnasse Infernale,2001,Deux employés de bureau incompétents se retrou...,"""Je ne peux pas croire que j'ai perdu du temps...",fr
1,The Dark Knight,2008,Batman (Christian Bale) teams up with District...,"""The Dark Knight is a thrilling and intense su...",eng
7,The Nice Guys,2016,"In 1970s Los Angeles, a private eye (Ryan Gosl...","""The Nice Guys tries too hard to be funny, and...",eng
3,The Godfather,1972,Don Vito Corleone (Marlon Brando) is the head ...,"""The Godfather is a classic movie that stands ...",eng
5,Blade Runner 2049,2017,"Officer K (Ryan Gosling), a new blade runner f...","""Boring and too long. Nothing like the origina...",eng
28,Torrente: El brazo tonto de la ley,1998,"En esta comedia española, un policía corrupto ...","""Torrente es una película vulgar y ofensiva qu...",sp
14,Le Fabuleux Destin d'Amélie Poulain,2001,Cette comédie romantique raconte l'histoire d'...,"""Le Fabuleux Destin d'Amélie Poulain est un fi...",fr
23,El Laberinto del Fauno,2006,"Durante la posguerra española, Ofelia (Ivana B...","""El Laberinto del Fauno es una película fascin...",sp
26,Toc Toc,2017,"En esta comedia española, un grupo de personas...","""Toc Toc es una película aburrida y poco origi...",sp
18,Les Visiteurs en Amérique,2000,Dans cette suite de la comédie française Les V...,"""Le film est une perte de temps totale. Les bl...",fr


### Text translation

Translate the **Review** and **Synopsis** column values to English.

In [4]:
# Create a function to translate text
def translate(text, model, tokenizer, target_language="en"):
    translated = model.generate(**tokenizer.prepare_seq2seq_batch([text], return_tensors="pt"))
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# Initialize models and tokenizers for French and Spanish translation
tokenizer_fr = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
model_fr = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-fr-en")
tokenizer_sp = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
model_sp = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-es-en")

# Translate French and Spanish reviews and synopses
for index, row in df.iterrows():
    if row["Original Language"] == "fr":
        df.at[index, "Synopsis"] = translate(row["Synopsis"], model_fr, tokenizer_fr)
        df.at[index, "Review"] = translate(row["Review"], model_fr, tokenizer_fr)
    elif row["Original Language"] == "sp":
        df.at[index, "Synopsis"] = translate(row["Synopsis"], model_sp, tokenizer_sp)
        df.at[index, "Review"] = translate(row["Review"], model_sp, tokenizer_sp)


Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]



Downloading pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



In [5]:
df.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language
29,El Incidente,2014,"In this Mexican horror film, a group of people...","""The Incident is a boring and frightless film ...",sp
2,Forrest Gump,1994,Forrest Gump (Tom Hanks) is a simple man with ...,"""Forrest Gump is a heartwarming and inspiratio...",eng
28,Torrente: El brazo tonto de la ley,1998,"In this Spanish comedy, a corrupt cop (played ...","""Torrente is a vulgar and offensive film that ...",sp
26,Toc Toc,2017,"In this Spanish comedy, a group of people with...","""Toc Toc is a boring and unoriginal film that ...",sp
11,Intouchables,2011,This film tells the story of the unlikely frie...,"""Untouchables is an incredibly touching film w...",fr
4,Inception,2010,Dom Cobb (Leonardo DiCaprio) is a skilled thie...,"""Inception is a mind-bending and visually stun...",eng
27,El Bar,2017,A group of people are trapped in a bar after M...,"""The Bar is a ridiculous and meaningless film ...",sp
7,The Nice Guys,2016,"In 1970s Los Angeles, a private eye (Ryan Gosl...","""The Nice Guys tries too hard to be funny, and...",eng
9,The Island,2005,In a future where people are cloned for organ ...,"""The Island is a bland and forgettable sci-fi ...",eng
8,Solo: A Star Wars Story,2018,A young Han Solo (Alden Ehrenreich) joins a gr...,"""Dull and pointless, with none of the magic of...",eng


#Sentiment Analysis
Use HuggingFace pretrained model for sentiment analysis of the reviews. Store the sentiment result Positive or Negative in a new column titled Sentiment in the dataframe.

In [11]:
# Initialize sentiment analysis pipeline
sentiment_analysis = pipeline("sentiment-analysis")

# Analyze sentiment of each review and add the result to the dataframe
df["Sentiment"] = df["Review"].apply(lambda x: sentiment_analysis(x)[0]["label"])

# Save the final dataframe as a CSV file
df.to_csv("/content/df_sentiment.csv", index=False)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [12]:
df.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language,Sentiment
4,Inception,2010,Dom Cobb (Leonardo DiCaprio) is a skilled thie...,"""Inception is a mind-bending and visually stun...",eng,POSITIVE
29,El Incidente,2014,"In this Mexican horror film, a group of people...","""The Incident is a boring and frightless film ...",sp,NEGATIVE
25,Águila Roja,(2009-2016),This Spanish television series follows the adv...,"""Red Eagle is a boring and uninteresting serie...",sp,NEGATIVE
12,Amélie,2001,This romantic comedy tells the story of Amélie...,"""Amélie is an absolutely charming film that wi...",fr,POSITIVE
26,Toc Toc,2017,"In this Spanish comedy, a group of people with...","""Toc Toc is a boring and unoriginal film that ...",sp,NEGATIVE
16,La Tour Montparnasse Infernale,2001,Two incompetent office workers find themselves...,"""I can't believe I've wasted time watching thi...",fr,NEGATIVE
24,Amores perros,2000,Three stories intertwine in this Mexican film:...,"""Amores dogs is an intense and moving film tha...",sp,POSITIVE
11,Intouchables,2011,This film tells the story of the unlikely frie...,"""Untouchables is an incredibly touching film w...",fr,POSITIVE
20,Roma,2018,Cleo (Yalitza Aparicio) is a young domestic wo...,"""Rome is a beautiful and moving film that pays...",sp,POSITIVE
13,Les Choristes,2004,This film tells the story of a music teacher w...,"""The Choristes are a beautiful film that will ...",fr,POSITIVE
