## Importation des données

Après avoir tenter d'essayer d'identifier les entreprises dans les textes originaux, nous nous sommes confronté à de nombreux problèmes. Suite à cela, et avec l'accord de notre encadrant Stat_app, nous avons fait le choix d'introduire le nom de certaines entreprises dans les articles originaux. Ce notebook vise à appliquer cette démarche

In [1]:
import pandas as pd

In [2]:
# Charger à partir du fichier pickle
data = pd.read_pickle('data.pkl')
data.head(5)

Unnamed: 0,Article,Date,Auteur,Nombre de mots,Journal,Titre,ID
0,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,31 December 2023,Nick Tabor,529,New York Times,Copyright 2023 The New York Times Company. Al...,NYTF000020240104ejcv0000d
1,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,31 December 2023,Wesley Morris,422,New York Times,"When Jim Brown and Raquel Welch, Two Sexy Star...",NYTF000020231231ejcv0006h
2,\n\nMagazine Desk; SECTMK\nTalking During Movi...,31 December 2023,,179,New York Times,Talking During Movies: Totally Evil or Part of...,NYTF000020231231ejcv00064
3,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,31 December 2023,,454,New York Times,Let Kids Vote!,NYTF000020231231ejcv00063
4,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,31 December 2023,Christina Caron,428,New York Times,Are We Doomed to Disagree?,NYTF000020231231ejcv0005z


## Nettoyage des articles

Maintenant que nous avons les informations sur le texte (Date, Auteur, Nombre de mots etc...) Nous pouvons nous permettre de ne garder uniquement le coeur de l'article :

In [3]:
data.insert(1, 'Copy_Article', data['Article'])
data.head(5)

Unnamed: 0,Article,Copy_Article,Date,Auteur,Nombre de mots,Journal,Titre,ID
0,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,31 December 2023,Nick Tabor,529,New York Times,Copyright 2023 The New York Times Company. Al...,NYTF000020240104ejcv0000d
1,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,31 December 2023,Wesley Morris,422,New York Times,"When Jim Brown and Raquel Welch, Two Sexy Star...",NYTF000020231231ejcv0006h
2,\n\nMagazine Desk; SECTMK\nTalking During Movi...,\n\nMagazine Desk; SECTMK\nTalking During Movi...,31 December 2023,,179,New York Times,Talking During Movies: Totally Evil or Part of...,NYTF000020231231ejcv00064
3,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,31 December 2023,,454,New York Times,Let Kids Vote!,NYTF000020231231ejcv00063
4,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,31 December 2023,Christina Caron,428,New York Times,Are We Doomed to Disagree?,NYTF000020231231ejcv0005z


Dans un premier temps, on supprime l'ID de l'article, qui se situe à la fin du texte

In [4]:
import re

# Fonction pour supprimer le texte après le motif spécifié
def supprimer_texte_apres_motif(article, motifs):
    motif = "|".join(motifs)  # Concaténer les motifs en une seule chaîne de caractères
    match = re.search(motif, article)
    if match:
        return article[:match.start()]
    else:
        return article

# Appliquer la fonction supprimer_texte_apres_motif à la colonne 'Coeur_Article' avec une liste de motifs
data['Copy_Article'] = data['Copy_Article'].apply(lambda x: supprimer_texte_apres_motif(x, ["Document J\d+", "Document NYTF\d+"]))

Vérification :

In [5]:
for i in range(1,5):
    print("Derniers caractères du", f"text_{i}", "avant suppression\n\n", data['Article'][i][-50:-1])
    print("Derniers caractères du", f"text_{i}", "après suppression\n\n", data['Copy_Article'][i][-50:-1])
    print("-----------------------------------------------------------------------------")

Derniers caractères du text_1 avant suppression

 M21, MM22. 

Document NYTF000020231231ejcv0006h


Derniers caractères du text_1 après suppression

 cle appeared in print on page MM20, MM21, MM22. 

-----------------------------------------------------------------------------
Derniers caractères du text_2 avant suppression

 K4, MK5. 

Document NYTF000020231231ejcv00064
 


Derniers caractères du text_2 après suppression

 his article appeared in print on page MK4, MK5. 

-----------------------------------------------------------------------------
Derniers caractères du text_3 avant suppression

 age MK3. 

Document NYTF000020231231ejcv00063
 


Derniers caractères du text_3 après suppression

 o.

This article appeared in print on page MK3. 

-----------------------------------------------------------------------------
Derniers caractères du text_4 avant suppression

 ge MK11. 

Document NYTF000020231231ejcv0005z
 


Derniers caractères du text_4 après suppression

 Š

This article 

On supprime maintenant tout ce qui est placé avant "All Rights Reserved.", qui correspond à la partie d'information du texte (Auteur etc...)

In [6]:
# Fonction pour supprimer le texte après le motif spécifié
def supprimer_texte_avant_motif(article, motif):
    match = re.search(motif, article)
    if match:
        return article[match.end():]
    else:
        return article

# Appliquer la fonction supprimer_texte_apres_motif à la colonne 'Coeur_Article' avec une liste de motifs
data['Copy_Article'] = data['Copy_Article'].apply(lambda x: supprimer_texte_avant_motif(x, "All Rights Reserved."))

Vérification :

In [7]:
for i in range(1,3):
    print("Derniers caractères du", f"text_{i}", "avant suppression\n\n", data['Article'][i][0:100],"\n\n\n")
    print("Derniers caractères du", f"text_{i}", "après suppression\n\n", data['Copy_Article'][i][0:100])
    print("-----------------------------------------------------------------------------")

Derniers caractères du text_1 avant suppression

 

Magazine Desk; SECTMM
When Jim Brown and Raquel Welch, Two Sexy Stars, Crossed Paths

By Wesley Mo 



Derniers caractères du text_1 après suppression

  

In their one movie together, their chemistry was radical.

Jim Brown & Raquel Welch B. 1936 and 1
-----------------------------------------------------------------------------
Derniers caractères du text_2 avant suppression

 

Magazine Desk; SECTMK
Talking During Movies: Totally Evil or Part of the Fun?

179 words
31 Decemb 



Derniers caractères du text_2 après suppression

  

debatethis

Talking during movies: Totally evil or part of the fun?

show of hands

The biggest p
-----------------------------------------------------------------------------


In [8]:
data.rename(columns={'Copy_Article': 'Coeur_Article'}, inplace=True)
data.head(5)

Unnamed: 0,Article,Coeur_Article,Date,Auteur,Nombre de mots,Journal,Titre,ID
0,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,"\n\nStony Brook University, one of two state ...",31 December 2023,Nick Tabor,529,New York Times,Copyright 2023 The New York Times Company. Al...,NYTF000020240104ejcv0000d
1,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,"\n\nIn their one movie together, their chemis...",31 December 2023,Wesley Morris,422,New York Times,"When Jim Brown and Raquel Welch, Two Sexy Star...",NYTF000020231231ejcv0006h
2,\n\nMagazine Desk; SECTMK\nTalking During Movi...,\n\ndebatethis\n\nTalking during movies: Tota...,31 December 2023,,179,New York Times,Talking During Movies: Totally Evil or Part of...,NYTF000020231231ejcv00064
3,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...,31 December 2023,,454,New York Times,Let Kids Vote!,NYTF000020231231ejcv00063
4,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...,31 December 2023,Christina Caron,428,New York Times,Are We Doomed to Disagree?,NYTF000020231231ejcv0005z


## Modification des articles

### Fonction d'insertion de chaines de charactères dans un article

In [9]:
import random

def insertion_phrase_dans_article(phrase, article):
    # Trouver tous les emplacements des points dans l'article
    emplacements_points = [i for i, char in enumerate(article) if char == '.']

    # Vérifier s'il y a des points dans l'article
    if emplacements_points:
        # Choisir aléatoirement l'un des emplacements des points
        indice_insertion = random.choice(emplacements_points)
        # Insérer la phrase juste après le point choisi
        article = article[:indice_insertion+1] + " " + phrase + article[indice_insertion+1:]
    else:
        # S'il n'y a pas de point, insérer la phrase au début de l'article
        article = phrase + " " + article
    return article

Vérification :

In [10]:
phrase_a_inserer = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA."
article_insertion = data['Coeur_Article'][3]
article_insertion = insertion_phrase_dans_article(phrase_a_inserer, article_insertion)

# Afficher l'article avec la phrase insérée
print(article_insertion)

 

LET KIDS

VOTE!

by Katherine Cusumano

Julia Rottenberg, 17, spent the fall of last year knocking on doors. On Election Day 2022, people in Culver City, Calif., her hometown, would have a big decision to make: Should the voting age for local elections change from 18 to 16? Julia wanted them to vote yes. ''I think a vote is one of the most direct ways that you can express an opinion and actually have some change happen,'' says Julia, who is part of an organization called Vote16 Culver City.

What's the argument for giving kids the vote? Well, as you may have noticed, there are a lot of decisions being made (or not) about things that affect kids directly, like climate change or gun violence or school resources. Yes, young people are already leading political movements around these issues. But without the vote, they can't elect politicians who represent their views and make real change. ''There's a whole lot of talk about prioritizing young people and politicians saying, 'We care abou

### Méthode 1 :

Dans cette méthode, on utilise le dictionnaire environnemental anglais, à partir duquel on créer un série de phrases à trou. Chaque phrases sera aléatoirement rempli par le nom d'entreprises, ainsi que des termes du dictionnaires. On pourra retrouver des phrases au sentiment positif, négatif, ou neutre d'un point de vue environnemental. Pour chaque entreprise, on génère un nombre aléatoire de phrases la concernant. Toutes ces phrases (ne traitant que d'une seule entreprise) sont ensuite insérés dans un seul et unique article. Ainsi : 1 Article = 1 entreprise identifiée.

#### Création des phrases à insérer

In [11]:
Dico_env_en = {
    
    "clean": 1,
    "ecological": 1,
    "sustainable": 1,
    "green": 1,
    "energy-efficient": 1,
    "renewable": 1,
    "responsible": 1,
    "conservation": 1,
    "biodiversity": 1,
    "healthy": 1,
    "organic": 1,
    "eco-friendly": 1,
    "environmentally friendly": 1,
    "efficient": 1,
    "innovative": 1,
    "ethical": 1,
    "fair": 1,
    "efficiency": 1,
    "social responsibility": 1,
    "sustainable": 1,
    "solidarity": 1,
    "conscious spreading": 1,
    "sustainable": 1,
    "clean energy": 1,
    "renewable energy": 1,
    "recycling": 1,
    "energy efficiency": 1,
    "circular economy": 1,
    "solar energy": 1,
    "wind energy": 1,
    "regeneration": 1,
    "preservation": 1,
    "restoration": 1,
    "rehabilitation": 1,
    "recovery": 1,
    "restorer": 1,
    "regenerator": 1,
    "revitalization": 1,
    "positive": 1,
    "beneficial": 1,
    "valorization": 1,
    "fulfillment": 1,
    "continuous improvement": 1,
    "prosperity": 1,
    "harmony": 1,
    "integrity": 1,
    "responsible consumption": 1,
    "eco-responsible": 1,
    "eco-conscious": 1,
    "sustainability": 1,
    "recoverable": 1,
    "green energy": 1,
    "greenhouse effect": 1,
    "eco-efficient": 1,
    "eco-innovation": 1,
    "well-being": 1,
    "eco-design": 1,
    "agroecology": 1,
    "permaculture": 1,
    "eco-citizen": 1,
    "carbon neutral": 1,
    "zero waste": 1,
    "organic": 1,
    "eco-label": 1,
    "sustainable mobility": 1,
    "eco-tourism": 1,
    "eco-habitat": 1,
    "conscious consumption": 1,
    
    "pollution": -1,
    "waste": -1,
    "deforestation": -1,
    "greenhouse gas emissions": -1,
    "contamination": -1,
    "destructive": -1,
    "irresponsible": -1,
    "wasteful": -1,
    "harmful": -1,
    "toxic": -1,
    "deterioration": -1,
    "degradation": -1,
    "damaging": -1,
    "harmful": -1,
    "perilous": -1,
    "worrisome": -1,
    "catastrophic": -1,
    "catastrophe": -1,
    "dangerous": -1,
    "threat": -1,
    "risk": -1,
    "hazardous": -1,
    "harmful": -1,
    "inappropriate": -1,
    "inadequate": -1,
    "inappropriate": -1,
    "harm": -1,
    "damage": -1,
    "pollutant": -1,
    "pollute": -1,
    "deteriorate": -1,
    "disruption": -1,
    "disrespectful": -1,
    "malevolent": -1,
    "damage": -1,
    "aggressive": -1,
    "ravager": -1,
    "spoil": -1,
    "disturb": -1,
    "damage": -1,
    "irreparable": -1,
    "toxicity": -1,
    "unacceptable": -1,
    "ecological damage": -1,
    "illegal logging": -1,
    "overconsumption": -1,
    "resource plundering": -1,
    "environmental degradation": -1,
    "destroyed natural habitat": -1,
    "excessive exploitation": -1,
    "overexploitation": -1,
    "climate change": -1,
    "environmental denial": -1,
}

negation_list = ["not", "no", "never", "none", "nil", "nothing", "nobody", "negative", "without", "more", "less"]

negation_cancellation_list = ["responsible", "originally", "source"]

In [12]:
import pandas as pd
import random

# Charger le fichier Excel contenant la liste des entreprises
df = pd.read_csv('Firms.csv')

# Listes de structures de phrases
def generate_positive_structures(company, positive_terms):
    return [
        f"The company {company} is committed to a {random.choice(positive_terms)[0]} approach to promote {random.choice(positive_terms)[0]}.",
        f"Thanks to its {random.choice(positive_terms)[0]} initiative, {company} strengthens its commitment to {random.choice(positive_terms)[0]}.",
        f"{company} implements {random.choice(positive_terms)[0]} practices to support {random.choice(positive_terms)[0]}.",
        f"As a {random.choice(positive_terms)[0]} company, {company} takes measures to encourage {random.choice(positive_terms)[0]}.",
        f"{company} communicates about its {random.choice(positive_terms)[0]} commitment and its positive contribution to {random.choice(positive_terms)[0]}.",
        f"{company} is recognized for its {random.choice(positive_terms)[0]} approach and its positive impact on {random.choice(positive_terms)[0]}.",
        f"Through its {random.choice(positive_terms)[0]} actions, {company} aims to improve {random.choice(positive_terms)[0]}.",
        f"{company} adopts a {random.choice(positive_terms)[0]} strategy to promote {random.choice(positive_terms)[0]}.",
        f"The {random.choice(positive_terms)[0]} approach of {company} reflects its commitment to {random.choice(positive_terms)[0]}.",
        f"{company} values its {random.choice(positive_terms)[0]} commitment and its respect for {random.choice(positive_terms)[0]}."
    ]

def generate_negative_structures(company, negative_terms):
    return [
        f"The company {company} is criticized for its lack of commitment to {random.choice(negative_terms)[0]}.",
        f"{company} is singled out for its {random.choice(negative_terms)[0]}.",
        f"{company}'s {random.choice(negative_terms)[0]} practices have raised concerns among environmentalists.",
        f"{company} faces scrutiny for its {random.choice(negative_terms)[0]} approach.",
        f"Some question {company}'s commitment due to its {random.choice(negative_terms)[0]}.",
        f"{company} is under fire for its {random.choice(negative_terms)[0]} strategy.",
        f"Concerns are raised about {company}'s {random.choice(negative_terms)[0]} practices.",
        f"{company} is criticized for its failure to address {random.choice(negative_terms)[0]}.",
        f"{company}'s {random.choice(negative_terms)[0]} initiative is viewed with skepticism.",
        f"{company} is blamed for its {random.choice(negative_terms)[0]} impact."
    ]

def generate_mixed_structures(company, positive_terms, negative_terms):
    return [
        f"{company} is exploring {random.choice(positive_terms)[0]} initiatives to address {random.choice(negative_terms)[0]}.",
        f"The company {company} is researching {random.choice(positive_terms)[0]} solutions for {random.choice(negative_terms)[0]}.",
        f"{company} is developing {random.choice(positive_terms)[0]} practices while managing {random.choice(negative_terms)[0]}.",
        f"The approach of {company} involves {random.choice(positive_terms)[0]} methods to mitigate {random.choice(negative_terms)[0]}.",
        f"{company}'s {random.choice(positive_terms)[0]} efforts are focused on {random.choice(negative_terms)[0]}.",
        f"{company} is committed to {random.choice(positive_terms)[0]} actions and addressing {random.choice(negative_terms)[0]}.",
        f"{company} integrates {random.choice(positive_terms)[0]} strategies with {random.choice(negative_terms)[0]} management.",
        f"The company {company} emphasizes {random.choice(positive_terms)[0]} practices alongside {random.choice(negative_terms)[0]}.",
        f"{company} implements {random.choice(positive_terms)[0]} measures while considering {random.choice(negative_terms)[0]}.",
        f"{company} is dedicated to {random.choice(positive_terms)[0]} approaches and {random.choice(negative_terms)[0]} initiatives."
    ]


# Fonction pour gérer les termes de négation
def handle_negation(term, score):
    if term in negation_list:
        return -score
    elif term in negation_cancellation_list:
        return 0
    else:
        return score

# Fonction pour générer une phrase sur la communication environnementale d'une entreprise
def generate_environmental_communication(company_list, env_dict, a, b):
    positive_terms = [(term, score) for term, score in env_dict.items() if score == 1]
    negative_terms = [(term, score) for term, score in env_dict.items() if score == -1]
    
    company_sentences = {}  # Dictionnaire pour regrouper les phrases par entreprise
    
    for company in company_list:
        num_sentences = random.randint(a, b)
        sentences = []
        
        for _ in range(num_sentences):
            if random.choice([True, False]):
                structures = generate_positive_structures(company, positive_terms)
            else:
                if negative_terms:
                    structures = generate_negative_structures(company, negative_terms) + generate_mixed_structures(company, positive_terms, negative_terms)
                else:
                    structures = generate_positive_structures(company, positive_terms) + generate_mixed_structures(company, positive_terms, negative_terms)

            sentence = random.choice(structures)
            
            # Gérer les termes de négation dans la phrase
            sentence_words = sentence.split()
            for i, word in enumerate(sentence_words):
                if word.lower() in [term.lower() for term, _ in positive_terms + negative_terms]:
                    original_score = next((score for term, score in positive_terms + negative_terms if term.lower() == word.lower()), None)
                    if original_score:
                        new_score = handle_negation(word.lower(), original_score)
                        if new_score != original_score:
                            replacement = next((term for term, score in env_dict.items() if score == new_score), None)
                            if replacement:
                                sentence_words[i] = replacement

            # Reconstruire la phrase modifiée
            modified_sentence = ' '.join(sentence_words)
            sentences.append(modified_sentence)
        
        company_sentences[company] = sentences
    
    return company_sentences

# Sélectionner 100 entreprises au hasard
random_companies = df['Company'].sample(n=100, random_state=42).tolist()

# Utiliser la fonction pour générer les phrases
a = 3
b = 5
company_sentences = generate_environmental_communication(random_companies, Dico_env_en, a, b)

# Afficher les phrases générées par entreprise
for company, sentences in company_sentences.items():
    print(f"{company}:")
    for idx, sentence in enumerate(sentences, 1):
        print(f"  {idx}. {sentence}")

Kirloskar Brothers Ltd:
  1. Concerns are raised about Kirloskar Brothers Ltd's resource plundering practices.
  2. The company Kirloskar Brothers Ltd is committed to a solidarity approach to promote carbon neutral.
  3. Kirloskar Brothers Ltd is committed to eco-conscious actions and addressing unacceptable.
Seer Inc:
  1. Seer Inc communicates about its continuous improvement commitment and its positive contribution to continuous improvement.
  2. Seer Inc implements renewable practices to support permaculture.
  3. The approach of Seer Inc involves positive methods to mitigate malevolent.
Samsung Life Insurance Co Ltd:
  1. Samsung Life Insurance Co Ltd is recognized for its greenhouse effect approach and its positive impact on eco-innovation.
  2. Samsung Life Insurance Co Ltd is recognized for its integrity approach and its positive impact on ethical.
  3. Samsung Life Insurance Co Ltd adopts a agroecology strategy to promote eco-design.
  4. Through its circular economy actions, 

On cherche maintenant à insérer une phrase dans le texte, cette insertion se fera de manière aléatoire, la phrase suivra un point. Ainsi, on pourra mettre dans cette phrase le nom d'une entreprise.

In [13]:
# Créer une nouvelle colonne "Entreprise_Insérée" dans le DataFrame
data['Entreprise_Insérée_1'] = None

# Remplir la colonne avec les noms des entreprises
for idx, (company, _) in enumerate(company_sentences.items()):
    data.at[idx, 'Entreprise_Insérée_1'] = company

# Afficher les premières lignes du DataFrame pour vérification
data.head()

Unnamed: 0,Article,Coeur_Article,Date,Auteur,Nombre de mots,Journal,Titre,ID,Entreprise_Insérée_1
0,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,"\n\nStony Brook University, one of two state ...",31 December 2023,Nick Tabor,529,New York Times,Copyright 2023 The New York Times Company. Al...,NYTF000020240104ejcv0000d,Kirloskar Brothers Ltd
1,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,"\n\nIn their one movie together, their chemis...",31 December 2023,Wesley Morris,422,New York Times,"When Jim Brown and Raquel Welch, Two Sexy Star...",NYTF000020231231ejcv0006h,Seer Inc
2,\n\nMagazine Desk; SECTMK\nTalking During Movi...,\n\ndebatethis\n\nTalking during movies: Tota...,31 December 2023,,179,New York Times,Talking During Movies: Totally Evil or Part of...,NYTF000020231231ejcv00064,Samsung Life Insurance Co Ltd
3,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...,31 December 2023,,454,New York Times,Let Kids Vote!,NYTF000020231231ejcv00063,Kontoor Brands Inc
4,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...,31 December 2023,Christina Caron,428,New York Times,Are We Doomed to Disagree?,NYTF000020231231ejcv0005z,Tauron Polska Energia SA


In [14]:
# Créer une nouvelle colonne pour les articles avec phrases insérées
data['Coeur_Article_Inséré_1'] = ""

# Insérer les phrases générées dans les articles
for company, sentences in company_sentences.items():
    article = data[data['Entreprise_Insérée_1'] == company]['Coeur_Article'].iloc[0]  # Récupérer l'article associé à l'entreprise
    for sentence in sentences:
        article = insertion_phrase_dans_article(sentence, article)
    data.loc[data['Entreprise_Insérée_1'] == company, 'Coeur_Article_Inséré_1'] = article  # Mettre à jour l'article dans la nouvelle colonne

data.head()


Unnamed: 0,Article,Coeur_Article,Date,Auteur,Nombre de mots,Journal,Titre,ID,Entreprise_Insérée_1,Coeur_Article_Inséré_1
0,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,"\n\nStony Brook University, one of two state ...",31 December 2023,Nick Tabor,529,New York Times,Copyright 2023 The New York Times Company. Al...,NYTF000020240104ejcv0000d,Kirloskar Brothers Ltd,"\n\nStony Brook University, one of two state ..."
1,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,"\n\nIn their one movie together, their chemis...",31 December 2023,Wesley Morris,422,New York Times,"When Jim Brown and Raquel Welch, Two Sexy Star...",NYTF000020231231ejcv0006h,Seer Inc,"\n\nIn their one movie together, their chemis..."
2,\n\nMagazine Desk; SECTMK\nTalking During Movi...,\n\ndebatethis\n\nTalking during movies: Tota...,31 December 2023,,179,New York Times,Talking During Movies: Totally Evil or Part of...,NYTF000020231231ejcv00064,Samsung Life Insurance Co Ltd,\n\ndebatethis\n\nTalking during movies: Tota...
3,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...,31 December 2023,,454,New York Times,Let Kids Vote!,NYTF000020231231ejcv00063,Kontoor Brands Inc,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...
4,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...,31 December 2023,Christina Caron,428,New York Times,Are We Doomed to Disagree?,NYTF000020231231ejcv0005z,Tauron Polska Energia SA,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...


### Méthode 2

Ici, on cherche juste à insérer de manière aléatoire, le nom d'entreprises dans un article. Pour cela, on créer une fonction qui tire au sort le nombre d'entreprises à insérer dans un article donné, puis on insert toutes ces entreprises dans les articles. Ici : 1 article = plusieurs entreprises

On choisit aléatoirement le nombre d'entreprises à insérer dans le texte (en contrôlant que ce nombre soit borné : on pourra nous même définir les bornes)

In [15]:
import pandas as pd

# Liste de noms d'entreprises
noms_entreprises = pd.read_csv('Firms.csv')['Company'].tolist()

# Fonction pour sélectionner aléatoirement un nombre d'entreprises entre a et b
def choisir_entreprises(a, b):
    # Assurez-vous que b est inférieur ou égal à la longueur de la liste des noms d'entreprises
    b = min(b, len(noms_entreprises))
    # Choisissez un nombre aléatoire d'entreprises compris entre a et b
    nb_entreprises = random.randint(a, b)
    return random.sample(noms_entreprises, nb_entreprises)

choisir_entreprises(1,2)

['Magnis Energy Technologies Ltd']

On insert toutes ces entreprises dans un texte

In [16]:
entreprises_a_inserer = choisir_entreprises(1,2)
article_insertion = data['Coeur_Article'][8]

for entreprise in entreprises_a_inserer:
    article_insertion = insertion_phrase_dans_article(entreprise, article_insertion)
    
print(entreprises_a_inserer, article_insertion)

['Shunfeng International Clean Energy Ltd', 'China Resources Sanjiu Medical & Pharmaceutical Co Ltd']  

A Few Days Full of Trouble: Revelations on the Journey to Justice for My Cousin and Best Friend, Emmett Till, by the Rev. Wheeler Parker Jr. and Christopher Benson. (One World, 432 pp., $18.99.) ''I have known the truth,'' Parker writes again and again in this moving memoir, recounting his family's devastation at the 1955 lynching of his cousin, his life as a minister resisting racism and the aftermath of the F.B.I.'s 2018 "reawakening'' of Till's murder case.

Age of Vice, by Deepti Kapoor. (Riverhead, 560 pp., $20.) Kapoor's thriller ushers readers through the underbelly of contemporary New Delhi. Shunfeng International Clean Energy Ltd It follows Ajay, a servant of a powerful crime family, charged with protecting their eldest son. But as a journalist narrows in on the family's misdeeds amid a deadly incident, Ajay must increasingly shield himself.

Roses, in the Mouth of a Lion, 

#### Automatisation pour tout les articles 

In [17]:
import warnings
warnings.filterwarnings("ignore")

nb_min_entreprise_par_article = 1
nb_max_entreprise_par_article = 2
liste_articles = data['Coeur_Article'].tolist()
data['Article_avec_entreprises_2'] = None
data['Entreprises_inserees_2'] = None


for article in liste_articles:
    numero_article = liste_articles.index(article)
    entreprises_a_inserer = choisir_entreprises(nb_min_entreprise_par_article, nb_max_entreprise_par_article)
    
    for entreprise in entreprises_a_inserer:
        article = insertion_phrase_dans_article(entreprise, article)
        
    data['Entreprises_inserees_2'][numero_article] = entreprises_a_inserer
    data['Article_avec_entreprises_2'][numero_article] = article

data.insert(1,'Entreprises_inserées_2', data['Entreprises_inserees_2'])
data.pop('Entreprises_inserees_2')

data.insert(2,'Articles_avec_entreprises_2', data['Article_avec_entreprises_2'])
data.pop('Article_avec_entreprises_2')

data.head(5)

Unnamed: 0,Article,Entreprises_inserées_2,Articles_avec_entreprises_2,Coeur_Article,Date,Auteur,Nombre de mots,Journal,Titre,ID,Entreprise_Insérée_1,Coeur_Article_Inséré_1
0,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,[BVZ Holding AG],"\n\nStony Brook University, one of two state ...","\n\nStony Brook University, one of two state ...",31 December 2023,Nick Tabor,529,New York Times,Copyright 2023 The New York Times Company. Al...,NYTF000020240104ejcv0000d,Kirloskar Brothers Ltd,"\n\nStony Brook University, one of two state ..."
1,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,[Samsung C&T Corp],"\n\nIn their one movie together, their chemis...","\n\nIn their one movie together, their chemis...",31 December 2023,Wesley Morris,422,New York Times,"When Jim Brown and Raquel Welch, Two Sexy Star...",NYTF000020231231ejcv0006h,Seer Inc,"\n\nIn their one movie together, their chemis..."
2,\n\nMagazine Desk; SECTMK\nTalking During Movi...,[Bruker Corp],\n\ndebatethis\n\nTalking during movies: Tota...,\n\ndebatethis\n\nTalking during movies: Tota...,31 December 2023,,179,New York Times,Talking During Movies: Totally Evil or Part of...,NYTF000020231231ejcv00064,Samsung Life Insurance Co Ltd,\n\ndebatethis\n\nTalking during movies: Tota...
3,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,"[New Hope Dairy Co Ltd, 361 Degrees Internatio...",\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...,31 December 2023,,454,New York Times,Let Kids Vote!,NYTF000020231231ejcv00063,Kontoor Brands Inc,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...
4,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,[Mondi plc],\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...,31 December 2023,Christina Caron,428,New York Times,Are We Doomed to Disagree?,NYTF000020231231ejcv0005z,Tauron Polska Energia SA,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...


In [18]:
data_avec_entreprises = data
data_avec_entreprises.head(5)

Unnamed: 0,Article,Entreprises_inserées_2,Articles_avec_entreprises_2,Coeur_Article,Date,Auteur,Nombre de mots,Journal,Titre,ID,Entreprise_Insérée_1,Coeur_Article_Inséré_1
0,\nMetropolitan Desk; SECTMB\nCan an Ambitious ...,[BVZ Holding AG],"\n\nStony Brook University, one of two state ...","\n\nStony Brook University, one of two state ...",31 December 2023,Nick Tabor,529,New York Times,Copyright 2023 The New York Times Company. Al...,NYTF000020240104ejcv0000d,Kirloskar Brothers Ltd,"\n\nStony Brook University, one of two state ..."
1,\n\nMagazine Desk; SECTMM\nWhen Jim Brown and ...,[Samsung C&T Corp],"\n\nIn their one movie together, their chemis...","\n\nIn their one movie together, their chemis...",31 December 2023,Wesley Morris,422,New York Times,"When Jim Brown and Raquel Welch, Two Sexy Star...",NYTF000020231231ejcv0006h,Seer Inc,"\n\nIn their one movie together, their chemis..."
2,\n\nMagazine Desk; SECTMK\nTalking During Movi...,[Bruker Corp],\n\ndebatethis\n\nTalking during movies: Tota...,\n\ndebatethis\n\nTalking during movies: Tota...,31 December 2023,,179,New York Times,Talking During Movies: Totally Evil or Part of...,NYTF000020231231ejcv00064,Samsung Life Insurance Co Ltd,\n\ndebatethis\n\nTalking during movies: Tota...
3,\n\nMagazine Desk; SECTMK\nLet Kids Vote!\n\n4...,"[New Hope Dairy Co Ltd, 361 Degrees Internatio...",\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...,31 December 2023,,454,New York Times,Let Kids Vote!,NYTF000020231231ejcv00063,Kontoor Brands Inc,\n\nLET KIDS\n\nVOTE!\n\nby Katherine Cusuman...
4,\n\nMagazine Desk; SECTMK\nAre We Doomed to Di...,[Mondi plc],\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...,31 December 2023,Christina Caron,428,New York Times,Are We Doomed to Disagree?,NYTF000020231231ejcv0005z,Tauron Polska Energia SA,\n\nare we DOOMED TO DISAGREE?\n\nwhy it's so...


## Exportation du nouveau tableau des données

In [19]:
# Enregistrer en tant que fichier pickle pour conserver les types de données
data_avec_entreprises.to_pickle('data_avec_entreprises.pkl')