**PERFORM TOPIC LABELING AND SENTIMENT ANALYSIS WITH Generative LLM MODEL**

<p style="text-align:justify; ; line-height: 1.5">This notebook aims to demonstrate how to leverage a Large Language Model (LLM) to perform various tasks such as review labeling and sentiment extraction. We will be using the LangChain framework, which has been specifically developed to facilitate interaction with LLMs.</p>

#### 0. Install package

In [None]:
%pip install langchain==0.0.200 --quiet
%pip install pandas==1.5.3 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

#### 1. Package loading

In [None]:
import pandas as pd
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import ResponseSchema, StructuredOutputParser

In [None]:
OPENAI_API_KEY = "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"

#### 2. Building the prompt

##### 2.1. Prompt for topics description and clustering

<p style="text-align:justify; ; line-height: 1.5">Topic modeling with LDA has allowed us to extract 10 topics from our dataset. Using the prompts we define, we will generate descriptions for each of these topics based on their characteristic equations. Afterward, we will group them into 3 main topics.
Note: We don't need LangChain at this stage. ChatGPT will be sufficient to complete the task. Please copy and paste the equations and prompts into ChatGPT.</p>

In [None]:
[(0,
  '0.036*"conseil" + 0.031*"souscript" + 0.024*"téléphon" + 0.021*"expliqu" + 0.021*"contrat" + 0.020*"question" + 0.019*"inform" + 0.016*"temp" + 0.016*"clair" + 0.014*"person"'),
 (1,
  '0.054*"sérieux" + 0.031*"projet" + 0.024*"entretien" + 0.022*"technicien" + 0.021*"téléphon" + 0.019*"travail" + 0.017*"entrepris" + 0.016*"demand" + 0.016*"interven" + 0.015*"mis"'),
 (2,
  '0.021*"factur" + 0.018*"mois" + 0.016*"pai" + 0.015*"san" + 0.015*"totalenerg" + 0.014*"contrat" + 0.012*"toujour" + 0.011*"appel" + 0.010*"demand" + 0.010*"fair"'),
 (3,
  '0.017*"avis" + 0.015*"fair" + 0.014*"dan" + 0.014*"confianc" + 0.012*"dommag" + 0.010*"voir" + 0.010*"commentair" + 0.010*"cel" + 0.009*"attend" + 0.008*"mêm"'),
 (4,
  '0.049*"ras" + 0.046*"total" + 0.027*"énerg" + 0.027*"fournisseur" + 0.027*"energ" + 0.023*"électr" + 0.018*"compet" + 0.017*"chang" + 0.016*"abon" + 0.016*"déroul"'),
 (5,
  '0.144*"rapid" + 0.083*"efficac" + 0.057*"facil" + 0.051*"simpl" + 0.042*"satisfait" + 0.033*"souscript" + 0.033*"clair" + 0.022*"sit" + 0.021*"merc" + 0.014*"content"'),
 (6,
  '0.035*"solair" + 0.033*"panneau" + 0.031*"votr" + 0.027*"install" + 0.026*"satisf" + 0.024*"consomm" + 0.023*"excellent" + 0.022*"suiv" + 0.021*"redir" + 0.019*"instant"'),
 (7,
  '0.080*"professionnel" + 0.064*"conseil" + 0.057*"écout" + 0.055*"expliqu" + 0.038*"compétent" + 0.035*"clair" + 0.029*"tre" + 0.028*"agréabl" + 0.028*"merc" + 0.025*"accueil"'),
 (8,
  '0.061*"technicien" + 0.028*"intervent" + 0.024*"chaudi" + 0.016*"dan" + 0.015*"jour" + 0.014*"appel" + 0.012*"san" + 0.010*"chauffag" + 0.010*"rdv" + 0.009*"eau"'),
 (9,
'0.066*"prix" + 0.037*"tarif" + 0.029*"qualit" + 0.027*"rapport" + 0.020*"impecc" + 0.020*"intéress" + 0.019*"offre" + 0.017*"pass" + 0.016*"propo"')]

[(0,
  '0.036*"conseil" + 0.031*"souscript" + 0.024*"téléphon" + 0.021*"expliqu" + 0.021*"contrat" + 0.020*"question" + 0.019*"inform" + 0.016*"temp" + 0.016*"clair" + 0.014*"person"'),
 (1,
  '0.054*"sérieux" + 0.031*"projet" + 0.024*"entretien" + 0.022*"technicien" + 0.021*"téléphon" + 0.019*"travail" + 0.017*"entrepris" + 0.016*"demand" + 0.016*"interven" + 0.015*"mis"'),
 (2,
  '0.021*"factur" + 0.018*"mois" + 0.016*"pai" + 0.015*"san" + 0.015*"totalenerg" + 0.014*"contrat" + 0.012*"toujour" + 0.011*"appel" + 0.010*"demand" + 0.010*"fair"'),
 (3,
  '0.017*"avis" + 0.015*"fair" + 0.014*"dan" + 0.014*"confianc" + 0.012*"dommag" + 0.010*"voir" + 0.010*"commentair" + 0.010*"cel" + 0.009*"attend" + 0.008*"mêm"'),
 (4,
  '0.049*"ras" + 0.046*"total" + 0.027*"énerg" + 0.027*"fournisseur" + 0.027*"energ" + 0.023*"électr" + 0.018*"compet" + 0.017*"chang" + 0.016*"abon" + 0.016*"déroul"'),
 (5,
  '0.144*"rapid" + 0.083*"efficac" + 0.057*"facil" + 0.051*"simpl" + 0.042*"satisfait" + 0.033*"so

In [None]:
topic_description_prompt =  """
                            Each item in this list represents a topic. These topics were obtained through the execution of topic\
                            modeling (latent Dirichlet allocation) on customer reviews collected from a website called Trustpilot.\
                            These reviews are related to energy suppliers. Using this input, generate a clear and concise description\
                            of each topic in a maximum of 2 sentences. Highlight the keywords that characterize each topic. Think step by\
                            step to provide a clear and precise response.
                            """

clustering_topic_prompt = """
                            Try to group these topics into 3 clusters. The topics within each group should be as similar as possible.\
                            You will then describe each cluster in one sentence while highlighting the top 5 words that best characterize\
                            each topic. The clusters should be as comprehensive as possible. You will also inform me of the topics you have\
                            assigned to a given cluster.

                            Think step by step to provide the most reliable and concise result.
                            Your reasoning steps are as follows:
                            Step1 : Analyse
                            Step 1: performs a careful semantic analysis of the descriptions associated with each topic
                            Step 2: Using the results of your analysis in step 2, group the topics into 3 clusters
                            Step 3: Describe each cluster obtained in step 2.
                          """

##### 2.2. Prompt format for labelisation

At this stage, ChatGTP allowed us to group the topics into clusters and describe them. In this step, we are using Langchain to label reviews using the previous results.

In [None]:
prompt = """
         Your task is to assign the comment located between the <tag></tag> tags to one of the three topics defined as follows:

            Topic_1: Customer Service and Reviews: This cluster focuses on interactions with customer service, trust in the\
                     company, and the professional aspect of the service.
                     Characteristic keywords: "customer service," "reviews," "professional," "trust," "listening."

            Topic_2: Energy and Billing: This cluster addresses aspects related to energy, billing, prices, and the quality of\
                     energy services. Characteristic keywords: "energy," "billing," "prices," "rate," "quality."

            Topic_3: Projects and Techniques: This cluster deals with projects, efficiency, speed, and technical interventions\
                     related to renewable energy and heating.
                     Characteristic keywords: "projects," "efficiency," "solar," "heating," "technicians."

            For this comment, you will need to identify the words or expressions that best express the assigned topic and separate\
            them with hyphens. When necessary, lemmatize these words to bring them to their corrected canonical form from the \
            dictionary. For example, the words "work," "working," and "works" will be reduced to the canonical word "work."\
            Words within expressions should not be separated by hyphens.

            Additionally, you will need to determine the sentiment expressed by the customer in the comment. This sentiment can\
            be positive, negative, or neutral.

            In the output, you should provide the following three pieces of information:
            topic: the main subject expressed in the comment
            word: the words that best describe this subject.
            sentiment: the sentiment conveyed in the comment

            Here's an example:
            - comment: "All my products arrived defective. It's really disappointing and frustrating."
            - output:
              topic: Topic_1
              word: defective-disappointing-frustrating
              sentiment: negative

            Produce the outputs for the following comment:
            <tag>{review}</tag>

            {format_instructions}
          """

##### 2.3 Definition of the LLM output parser.

<p style="text-align:justify; ; line-height: 1.5">The output parser helps the LLM to structure its output into a more manageable format to allow us to extract the desired information. We will need to extract the topic, the words from the review that best express the topic, and the sentiment associated with the review."</p>

In [None]:
response_schema = [ResponseSchema(name="topic",description="the main subject expressed in the comment"),
                   ResponseSchema(name="topic_word", description="the words that best describe this subject."),
                   ResponseSchema(name="sentiment", description="the sentiment conveyed in the comment")]

output_parser = StructuredOutputParser.from_response_schemas(response_schema)

In [None]:
format_instructions = output_parser.get_format_instructions()

##### 2.4 Initialization of the LLM and promptTemplate

In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.13.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.4-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.8/77.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.

In [None]:
chat_model = ChatOpenAI(temperature=0.0, model='gpt-3.5-turbo', openai_api_key= OPENAI_API_KEY)
prompt_template = ChatPromptTemplate.from_template(template=prompt)

#### 3. Content generation

In [None]:

def get_response(customer_reviews:str)->dict:
    """
    Generate the topic, topic word and sentiment of the review with a LLM

    Args:
        customer_reviews (str): the customer reviews

    Returns:
        dict: outputs extracted from the llm generation
    """

    # Initialization of the template
    messages = prompt_template.format_messages(review=customer_reviews,
                                               format_instructions=format_instructions)

    # get the completion from chat model
    response = chat_model(messages=messages)

    # parse the output
    try:
        response = output_parser.parse(response.content)
    except Exception as ex :
        print(f"{ex} : failed to parse llm output")
        response = {"topic":'null','topic_word':'null','sentiment':'null'}

    return response

In [None]:
# Import data wich contains our verbatim
data = pd.read_parquet('/content/reviews_cleaned.parquet')

# Sampling
data = data.loc[:,['clean_verb','note']].sample(n=10, random_state=7).reset_index().drop('index', axis=1)
data = data.rename(columns={'clean_verb': 'verbatim'})

In [None]:
data.head()

Unnamed: 0,verbatim,note
0,bon échange avec les conseillers,5
1,tarifs élevés mais centrale d'appels en france...,3
2,le service client ilek est exemplaire,5
3,mauvaise expérience sur la 1ère facture de rap...,3
4,je recommande ilek prise en charge rapide et e...,5


In [None]:
# Apply get_response function on each row of our dataframe
data['output_dict'] = data.apply(lambda x: get_response(x['verbatim']), axis=1)
data['topic'] = data.apply(lambda x: x['output_dict']['topic'], axis=1)
data['topic_word'] = data.apply(lambda x: x['output_dict']['topic_word'], axis=1)
data['sentiment'] = data.apply(lambda x: x['output_dict']['sentiment'], axis=1)

AttributeError: module 'openai' has no attribute 'error'

In [None]:
data.head()

Unnamed: 0,verbatim,note
0,prix attractif conseiller sympathique,5
1,très aimable efficace impeccable merci beaucoup,5
2,j'avais souscrit sur leur site en ligne ma sou...,3
3,ilek n'est pas un fournisseur de gaz ou d'élec...,5
4,la souscription a été extrêmement compliquée j...,1


In [None]:
data.loc[2,"verbatim"]

"j'avais souscrit sur leur site en ligne ma souscription avait bien été confirmé et on devait me contacter si besoin cependant 1 mois après ne recevant aucune nouvelle de mon contrat j'ai contacté le service client et j'ai du tout recommencer par téléphone pour confirmer ma souscription et la valider quelle perte de temps ! par conséquent à quoi sert la souscription sur votre site ? aucune relances par mail ou téléphone me demandant de les contacter !"

In [None]:
# Save data
data.loc[:,['verbatim','note','topic','topic_word','sentiment']].to_csv('data/data_review.csv')

KeyError: "['topic', 'topic_word', 'sentiment'] not in index"