<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Install-/-Download-/-Import-Librairies" data-toc-modified-id="Install-/-Download-/-Import-Librairies-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Install / Download / Import Librairies</a></span></li><li><span><a href="#Part-1---Scraping-of-general-information-(id,-title,-date,-authors-and-language)." data-toc-modified-id="Part-1---Scraping-of-general-information-(id,-title,-date,-authors-and-language).-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Part 1 - Scraping of general information (id, title, date, authors and language).</a></span></li><li><span><a href="#Part-2---Scraping-of-specific-information-(keywords-and-abstracts)." data-toc-modified-id="Part-2---Scraping-of-specific-information-(keywords-and-abstracts).-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Part 2 - Scraping of specific information (keywords and abstracts).</a></span></li><li><span><a href="#Part-3---Extraction-of-information-from-abstracts-and-titles" data-toc-modified-id="Part-3---Extraction-of-information-from-abstracts-and-titles-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Part 3 - Extraction of information from abstracts and titles</a></span><ul class="toc-item"><li><span><a href="#Part-3-1)" data-toc-modified-id="Part-3-1)-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Part 3-1)</a></span><ul class="toc-item"><li><span><a href="#Pls-éléments" data-toc-modified-id="Pls-éléments-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Pls éléments</a></span></li></ul></li></ul></li></ul></div>

# Install / Download / Import Librairies

In [1]:
# Scraping librairies
import requests
from bs4 import BeautifulSoup

# Text librairy
import re

# Useful librairies
from tqdm import tqdm
import pandas as pd
import numpy as np

# Part 1 - Scraping of general information (id, title, date, authors and language). 
In this first part, we will retrieve information that can be accessed directly from the search page (without the need to "click" on the article link). 

In [2]:
# Create a session object
with requests.Session() as s:

    # Opening the "df_infos.csv" file (rather creation as it does not exist) in write mode.
    # This file will store the scraper information (which avoids re-executing the code each time).
    with open("df_infos.csv", "w") as outf:

        # Writing the header line.
        outf.write("Id;Date;Auteur;Titre;Langue\n")

        # Browsing the pages to retrieve information from 11450 articles (100 articles per page)
        for page in tqdm(range(0, 11450, 100)):

            # Research equation on title and abstract only (not on all the content)
            research_eq = "(vehicle* OR transport OR transports OR train OR trains OR tractor OR bike* \
            OR bicycle* OR boat OR boats OR ship OR ships OR *plane OR *planes OR aircraft* OR car \
            OR cars OR truck* OR lorry* OR bus OR automobile* OR motor* OR rocket*) \
            AND hydrogen AND (ecolog* OR climat* OR pollution* OR environment* OR \"renewable energy\" \
            OR emission* OR \"carbon neutrality\" OR \"global warming\" OR conservation* OR sustaina*)"

            # Data to make the request
            request_data = {"basicQuery": {
                "count": 100,  # 100 articles per page
                "searchCriteria": "title : (" + research_eq + ") abstract:(" + research_eq + ")",
                "offset": page,  # page number
                "sortByDate": False}  # we sort by relevance and not by date
            }

            # Sending the POST request with the request data.
            res = s.post("https://core.ac.uk/search/api/search",
                         json=request_data)

            # Storage of results (in .json format)
            json = res.json()

            # Separation of different information :

            # (1) Identifiants
            ids = [elem['id'] for elem in json['results']]

            # (2) Dates
            dates = [elem['datePublished'][:10]
                     if 'datePublished' in elem else '' for elem in json['results']]

            # (3) Authors
            authors = [elem['authorsString'].replace('\n', '').replace('\r', '').replace('/', '').replace(',', '/')
                       if 'authorsString' in elem else '' for elem in json['results']]

            # (4) Titles
            titles = [elem['title'].replace('\n', '').replace(';', ',').replace('\r', '')
                      if 'title' in elem else '' for elem in json['results']]

            # (5) Language
            languages = [elem['language']['name']
                         if 'language' in elem else '' for elem in json['results']]

            # For each of the 100 articles on the current page ... 
            for identifiant, date, author, title, language in zip(ids, dates, authors, titles, languages):
                # ... writing to the .csv file. 
                outf.write(identifiant + ';' + date + ';' + author +
                           ';' + title + ';' + language + '\n')

100%|██████████| 115/115 [15:14<00:00,  7.95s/it]


# Part 2 - Scraping of specific information (keywords and abstracts).
Thanks to the recovered identifiers in part 1, we can access the content of the articles. In particular, we retrieve the keywords and summaries (if present). 

In [2]:
# To avoid re-executing the code from part 1 (15 min), we read the .csv file into a dataframe.
df1 = pd.read_csv("df_infos.csv", sep=';', index_col=False, encoding='utf-8')

# Any duplicates are removed
df1 = df1.drop_duplicates()

# Format
df1.shape

(8801, 5)

Once the duplicates are removed, only 8801 articles (out of 11450) remain.

In [9]:
# Create a session object
with requests.Session() as s:

    # Opening the "df_content.csv" file (rather creation as it does not exist) in write mode.
    # This file will store the scraper information (which avoids re-executing the code each time).
    with open("df_content.csv", "w") as outf:

        # Writing the header line.
        outf.write("Id;MotCle;Abstract\n")

        # For each identifiant
        for id in tqdm(df1['Id']):
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            
            

            url = "https://core.ac.uk/display/" + str(id) + "?recSetID="
            res = s.get(url)
            soup = BeautifulSoup(res.text)

            # MOTS CLES
            try:
                divs = soup.find('div', {'class': 'article_sum'})
                topic = divs.find('div', {'class': None}).text.replace(
                    "\r", "").replace('\n', '').strip()
                topic = ' '.join(topic.split())[8:]  # enlever topics

                # Mots clés dans la zone "Topics" -> soit séparer par , soit par ;
                if (';' in topic and ',' not in topic):
                    topic = topic.split(';')
                elif (',' in topic and ';' not in topic):
                    topic = topic.split(',')
                elif (';' in topic and ',' in topic):
                    tempo = topic.split(';')
                    topic = []
                    for i in tempo:
                        topic.extend(i.split(','))
                elif ('.' in topic):
                    topic = topic.split('.')
                elif (' - ' in topic):
                    topic = topic.split(' - ')
                elif ('and' in topic):
                    topic = topic.split('and')
                elif ('AND' in topic):
                    topic = topic.split('AND')
                elif ('and' in topic):
                    topic = topic.split('and')
                elif ('and' in topic):
                    topic = topic.split('and')
                else:
                    topic = [topic]

                # En plus, certains mots clés peuvent être dans le résumé (à la fin avec la mention "Keywords: ")
                abstract = soup.find('p',  {'class': 'abstract'})
                if (abstract is not None):
                    abstract = abstract.text.strip().replace("\r", "").replace('\n', '')
                    if ("Keywords" in abstract):
                        index = abstract.rfind("Keywords:") + len("Keywords:")
                        keywords = abstract[index:].split(',')
                        topic.extend(keywords)

            except:
                topic = []
                abstract = soup.find('p',  {'class': 'abstract'})
                if (abstract is not None):
                    abstract = abstract.text.strip().replace("\r", "").replace('\n', '')

                    # A la fin du résumé mais avec mention "keywords:"
                    if ("Keywords" in abstract):
                        index = abstract.rfind("Keywords:") + len("Keywords:")
                        keywords = abstract[index:].split(',')
                        topic.extend(keywords)

                    # A la fin du résumé (liste séparée par , ou ;)
                    else:
                        index = abstract.rfind('.') + 1
                        if (';' in abstract[index:]):
                            abstract_pv = abstract[index:].split(';')
                            topic = [topic for topic in abstract_pv]

                        elif (',' in abstract[index:]):
                            abstract_v = abstract[index:].split(',')

                            # La "," peut être présente dans une phrase normal, ne veut pas dire que c'est forcément une liste...
                            isTopics = True
                            for i in abstract_v:
                                if (len(i) >= 40):
                                    isTopics = False
                            if (isTopics):
                                topic = [topic for topic in abstract_v]

            # RESUMES
            try:
                abstract = soup.find('p',  {'class': 'abstract'}).text.strip().replace(
                    "\r", "").replace('\n', '').replace(';', ',')
            except:
                abstract = ''

            topic = [word for word in topic if len(word) > 3]
            chaine = ""
            for i in topic:
                chaine = chaine + i + "/"

            outf.write(str(id) + ';' + chaine + ';' + abstract + '\n')

100%|██████████| 8801/8801 [2:27:42<00:00,  1.01s/it]   


# Part 3 - Extraction of information from abstracts and titles
In part 1, we have collected the titles of the articles and in part 2, the abstracts. From this information, we will look for the vehicles, fields, technologies, etc. mentioned in the articles.

In [4]:
# To avoid re-executing the code from part 1 (15 min), we read the .csv file into a dataframe.
df2 = pd.read_csv("df_content.csv", sep=';', index_col=False, encoding='utf-8')

# Any duplicates are removed
df2 = df2.drop_duplicates()

# Articles that do not have an abstract are not kept
df2 = df2.dropna(subset=['Abstract'])

# Format
df2.shape

(8439, 3)

After deleting the articles without an abstract, only 8439 articles remain. 

In [5]:
# A join is made between the two dataframes.
# The join concerns the identifiers (only column in common between the two dataframes).
# The 'df' dataframe will contain all the information needed to populate the database. 
df = pd.merge(df1, df2)

# Any duplicates are removed
df = df.drop_duplicates()

# Format
df.shape

(8439, 7)

In the end, we will work on a corpus of 8439 articles. 

In [6]:
# Titles and abstracts are extracted into two variables. 
# To make it easier to find information, we put everything in lower case. 
# This avoids having to differentiate between "Car" and "car". 
title = df['Titre'].apply(lambda x : x if (x is np.nan) else x.lower())
abstract = df['Abstract'].apply(lambda x : x if (x is np.nan) else x.lower())

# Now that the abstracts are temporarily stored in a variable, they can be deleted from
# the dataframe (they will not be stored in the database). 
df = df.drop(columns=['Abstract'])

## Part 3-1) 

In [7]:
def extraction_info1(liste_att_BD, liste_att, title, abstract):
    liste = []

    for titre, resume in zip(tqdm(title), abstract): 
        try :
            liste_abs = [elem_BD for elem, elem_BD in zip(liste_att, liste_att_BD) if re.findall(elem, resume) != []]
        except : 
            liste_abs = []
        try : 
            liste_tit = [elem_BD for elem, elem_BD in zip(liste_att, liste_att_BD) if re.findall(elem, titre) != []]
        except :
            liste_tit = []

        liste_res = liste_abs + liste_tit

        liste_res = list(set(liste_res)) # enlever les doublons 

        res = ""
        for elem in liste_res : 
            res = res + elem + "/"
        
        #df3.iloc[cpt]['Pays'] = res

        liste.append(res)
    
    return liste

In [8]:
liste_pays = ["algeria", "angola", "(benin|dahomey)", "botswana", "burkina", "burundi", "cameroon", "cape verde", "central african republic", "( |,|\.|\-)chad( |,|\.|\-)", "comoros", "( |,|\.|\-)congo( |,|\.|\-)", "djibouti",
              "egypt", "equatorial guinea", "eritrea", "ethiopia", "( |,|\.|\-)gabon( |,|\.|\-)", "gambia", "ghana", "guinea", "guinea(\-| )bissau", "ivory coast", "kenya", "lesotho", "liberia", "libya",
              "madagascar", "malawi", "( |,|\.|\-)mali( |,|\.|\-)", "mauritania", "mauritius", "morocco", "mozambique", "namibia", "( |,|\.|\-)niger( |,|\.|\-)", "nigeria", "rwanda", "sao tome and principe", "senegal",
              "seychelles", "sierra leone", "somalia", "south(\-| )africa", "( |,|\.|\-)sudan( |,|\.|\-)", "swaziland", "tanzania", "( |,|\.|\-)togo( |,|\.|\-)", "tunisia", "uganda", "zambia", "zimbabwe", "albania", "andorra",
              "armenia", "austria", "azerbaijan", "belarus", "belgium", "bosnia", "bulgaria", "croatia", "cyprus", "(czech republic|czechia)", "denmark", "estonia", "finland",
              "france", "england", "scotland", "wales", "georgia", "germany", "greece", "hungary", "iceland", "ireland", "italy", "latvia", "liechtenstein", "lithuania", "luxembourg",
              "macedonia", "( |,|\.|\-)malta( |,|\.|\-)", "moldova", "monaco", "montenegro", "netherlands", "( |,|\.|\-)norway( |,|\.|\-)", "poland", "portugal", "romania", "san marino", "serbia", "slovakia", "slovenia",
              "(espana|spain)", "sweden", "switzerland", "ukraine", "united(\-| )kingdom", "vatican city", "antigua and barbuda", "bahamas", "barbados", "belize", "canada",
              "costa rica", "( |,|\.|\-)cuba( |,|\.|\-)", "dominica( |,|\.|\-)", "dominican republic", "el salvador", "grenada", "guatemala", "haiti", "honduras", "jamaica", "mexico", "nicaragua", "panama",
              "saint kitts and nevis", "saint lucia", "saint vincent and the grenadines", "trinidad and tobago", "(( |,|\.|\-)us( |,|\.|\-)|usa|united(\-| )states)", "argentina", "bolivia", "brazil", "( |,|\.|\-)chile( |,|\.|\-)",
              "colombia", "ecuador", "guyana", "paraguay", "( |,|\.|\-)peru( |,|\.|\-)", "suriname", "uruguay", "venezuela", "afghanistan", "bahrain", "bangladesh", "bhutan", "brunei", "(burma|myanmar)",
              "cambodia", "china", "(east timor|timor(\-| )leste)", "india", "indonesia", "( |,|\.|\-)iran( |,|\.|\-)", "( |,|\.|\-)iraq( |,|\.|\-)", "israel", "japan", "( |,|\.|\-)jordan( |,|\.|\-)", "kazakhstan",
              "(republic of korea|south(\-| )korea|korea, south)", "(north(\-| )korea|korea, north)", "kuwait", "kyrgyzstan", "laos", "lebanon", "malaysia", "maldives", "mongolia",
              "( |,|\.|\-)nepal( |,|\.|\-)", "( |,|\.|\-)oman( |,|\.|\-)", "pakistan", "palestine", "philippines", "qatar", "russia", "saudi arabia", "singapore", "sri lanka", "syria", "tajikistan", "thailand", "turkey",
              "turkmenistan", "united arab emirates", "uzbekistan", "vietnam", "yemen", "australia", "fiji", "kiribati", "marshall islands", "micronesia", "( |,|\.|\-)nauru( |,|\.|\-)", "new(\-| )zealand",
              "palau( |,|\.|\-)", "papua new guinea", "samoa", "solomon islands", "tonga", "tuvalu", "vanuatu", "democratic republic of congo"]

liste_pays_BD = ["Algérie", "Angola", "Bénin", "Botswana", "Burkina Faso", "Burundi", "Cameroun", "Cap vert", "République centrafricaine", "Tchad", "Comores", "Congo", "Djibouti",
                 "Egypte", "Guinée équatoriale", "Erythree", "Ethiopie", "Gabon", "Gambie", "Ghana", "Guinée", "Guinée-Bissau", "Côte d'Ivoire", "Kenya", "Lesotho", "Liberia",
                 "Libye", "Madagascar", "Malawi", "Mali", "Mauritanie", "Ile Maurice", "Maroc", "Mozambique", "Namibie", "Niger", "Nigeria", "Rwanda", "Sao Tomé-et-Principe",
                 "Sénégal", "Seychelles", "Sierra Leone", "Somalie", "Afrique du Sud", "Soudan", "Swaziland", "Tanzanie", "Togo", "Tunisie", "Ouganda", "Zambie", "Zimbabwe",
                 "Albanie", "Andorre", "Arménie", "Autriche", "Azerbaijan", "Biélorussie", "Belgique", "Bosnie", "Bulgarie", "Croatie", "Chypre", "République Tchèque", "Danemark",
                 "Estonie", "Finlande", "France", "Angleterre", "Ecosse", "Pays de Galles", "Géorgie", "Allemagne", "Grèce", "Hongrie", "Islande", "Irelande", "Italie", "Lettonie",
                 "Liechtenstein", "Lituanie", "Luxembourg", "Macédoine", "Malte", "Moldavie", "Monaco", "Montenegro", "Pays-Bas", "Norvège", "Pologne", "Portugal", "Roumanie",
                 "Saint-Marin", "Serbie", "Slovaquie", "Slovénie", "Espagne", "Suède", "Suisse", "Ukraine", "Royaume-Uni", "Vatican", "Antigua-et-Barbuda", "Bahamas", "La Barbade",
                 "Belize", "Canada", "Costa Rica", "Cuba", "Dominique", "République dominicaine", "El Salvador", "Grenade", "Guatemala", "Haiti", "Honduras", "Jamaique", "Mexique",
                 "Nicaragua", "Paname", "Saint-Christophe et Niévès", "Saint Lucie", "Saint Vincent et les Grenadines", "Trinidad et Tobago", "Etats-Unis", "Argentine", "Bolivie",
                 "Brésil", "Chili", "Colombie", "Equateur", "Guyane", "Paraguay", "Pérou", "Suriname", "Uruguay", "Venezuela", "Afghanistan", "Bahrein", "Bangladesh", "Bhoutan",
                 "Brunei", "Myanmar", "Cambodge", "Chine", "Timor oriental", "Inde", "Indonésie", "Iran", "Irak", "Israël", "Japon", "Jordanie", "Kazakhstan", "Corée du Sud",
                 "Corée du Nord", "Koweït", "Kyrgyzstan", "Laos", "Liban", "Malaysie", "Maldives", "Mongolie", "Népal", "Oman", "Pakistan", "Palestine", "Philippines", "Qatar",
                 "Russie", "Arabie Saoudite", "Singapour", "Sri Lanka", "Syrie", "Tajikistan", "Thaïlande", "Turquie", "Turkmenistan", "Emirats Arabes Unis", "Ouzbekistan",
                 "Viêtnam", "Yémen", "Australie", "Fiji", "République de Kiribati", "Iles Marshall", "Micronésie", "Nauru", "Nouvelle-Zélande", "Palau",
                 "Papouasie Nouvelle Guinée", "Iles Samoa", "Iles Salomon", "Tonga", "Tuvalu", "Vanuatu", "République Democratique du Congo"]


res = extraction_info1(liste_pays_BD, liste_pays, title, abstract)
df['Pays'] = res
# stocker dans un df

100%|██████████| 8439/8439 [00:24<00:00, 348.03it/s]


In [9]:
df[['Id','Pays']].sample(5, random_state=97)

Unnamed: 0,Id,Pays
3225,56974258,
7167,220150242,
6487,162044413,Norvège/
1945,4882323,Mexique/
3251,38632234,


In [11]:
cpt = 0 
for i in res:
    if (i != ""):
        cpt += 1
print("Il y a %d articles qui mentionnent des pays" %cpt)

Il y a 1916 articles qui mentionnent des pays


In [12]:
liste_vehicule = [" train", " tractor", " (bike|bicycle)", " (boat|ship|vessel|ferry|narrowboat|barge|liner|yacht)",
                  " (aeroplane| plane|airplane|aircraft| jet)", " (car|automobile|saloon|van|pickup)",
                  " (truck|lorry|semitrailer|rig|juggernaut|hgv|heavy goods vehicle)", " (bus|coach)", " rocket",
                  " helicopter", "(( | motor)bike| motorcycle)", " (tram|streetcar)", "submarine"]

# On veut un espace (fin du mot), une virgule, un s ou un point après nos mots. 
liste_vehicule = [elem + "( |s|,|\.)" for elem in liste_vehicule]
liste_vehicule_BD = ["Train", "Tracteur", "Vélo", "Bateau", "Avion", "Voiture", "Camion", "Bus", "Fusée",
                     "Hélicoptère", "Moto", "Tramway", "Sous-marin"]


res = extraction_info1(liste_vehicule_BD, liste_vehicule, title, abstract)
df['Vehicule'] = res

100%|██████████| 8439/8439 [00:04<00:00, 2051.85it/s]


In [13]:
df[['Id', 'Pays', 'Vehicule']].sample(5, random_state=97)

Unnamed: 0,Id,Pays,Vehicule
3225,56974258,,Voiture/
7167,220150242,,Fusée/
6487,162044413,Norvège/,
1945,4882323,Mexique/,Avion/
3251,38632234,,Bateau/


In [14]:
cpt = 0 
for i in res:
    if (i != ""):
        cpt += 1
print("Il y a %d articles qui mentionnent des véhicules" %cpt)

Il y a 2428 articles qui mentionnent des véhicules


In [15]:
set(res)

{'',
 'Avion/',
 'Avion/Bateau/Train/',
 'Avion/Bateau/Voiture/Train/',
 'Avion/Camion/',
 'Avion/Fusée/Train/',
 'Avion/Hélicoptère/',
 'Avion/Train/',
 'Avion/Train/Camion/',
 'Avion/Voiture/Train/',
 'Bateau/',
 'Bateau/Avion/',
 'Bateau/Avion/Camion/',
 'Bateau/Avion/Sous-marin/',
 'Bateau/Camion/',
 'Bateau/Fusée/',
 'Bateau/Fusée/Avion/',
 'Bateau/Fusée/Camion/',
 'Bateau/Fusée/Hélicoptère/',
 'Bateau/Sous-marin/',
 'Bateau/Tracteur/',
 'Bateau/Train/',
 'Bateau/Train/Camion/',
 'Bateau/Voiture/',
 'Bateau/Voiture/Avion/',
 'Bateau/Voiture/Camion/',
 'Bateau/Voiture/Hélicoptère/',
 'Bateau/Voiture/Train/',
 'Bus/',
 'Bus/Avion/',
 'Bus/Bateau/',
 'Bus/Bateau/Voiture/',
 'Bus/Camion/',
 'Bus/Train/',
 'Bus/Tramway/',
 'Bus/Voiture/',
 'Bus/Voiture/Camion/',
 'Bus/Voiture/Train/',
 'Camion/',
 'Camion/Bus/Bateau/Voiture/Avion/Train/',
 'Camion/Bus/Moto/Bateau/Voiture/Avion/',
 'Fusée/',
 'Fusée/Avion/',
 'Fusée/Avion/Camion/',
 'Fusée/Camion/',
 'Fusée/Hélicoptère/',
 'Fusée/Moto/V

In [16]:
# Marque 
liste_marque = ["hyundai", "toyota", "renault", "honda", "airbus", "boeing", "thales", "mercedes",
                "( |')audi( |'|\.|,)", "( |'|\.|,)kia( |'|\.|,)", "river( |\-|)simple", "nissan", 
                "( |'|\.|,)ford( |'|\.|,)",
                "daimler", "alstom", "bmw", "hopium", "peugeot", "volkswagen", "general motors", "( |'|\.|,)psa( |'|\.|,)", 
                "(roland gumpert|apollo automobil|gmbh)", "mazda", "aston martin", "pininfarina", "suzuki",
                "volvo", "( |'|\.|,)opel( |'|\.|,)", "dassault", "cessna", "bombardier", " mig( |'|\.|,)",
                "diamond aircraft", "zeroavia", "rolls-royce", "( |'|\.|,)eviation( |'|\.|,)", "(gknpz|khrounitchev)", 
                "spacex", "avio( |'|\.|,)", "ariane", "united launch alliance", "mcdonnell douglas", "mitsubishi", 
                "isro", "ioujnoïe", "citroen", "( |'|\.|,)fiat( |'|\.|,)", "( |'|\.|,)lancia( |'|\.|,)", "skoda",
                "yamaha", "( |'|\.|,)ktm( |'|\.|,)", "kawasaki", "ducati", "suzuki"]
    
liste_marque_BD = ["Hyundai", "Toyota", "Renault", "Honda", "Airbus", "Boeing", "Thalès", "Mercedes","Audi", "Kia",
                   "Riversimple", "Nissan", "Ford", "Daimler", "Alstom", "BMW", "Hopium", "Peugeot", "Volkswagen",
                   "General Motors", "PSA", "Roland Gumpert", "Mazda", "Aston Martin", "Pininfarina", "Suzuki",
                   "Volvo", "Opel", "Dassault", "Cessna", "Bombardier", "MiG", "Diamond Aircraft", "ZeroAvia",
                   "Rolls-Royce", "Eviation", "Khrounitchev", "SpaceX", "Avio", "ArianeGroup", 
                   "United Launch Alliance", "McDonnell Douglas", "Mitsubishi Heavy Industries", "ISRO", 
                   "Ioujnoie", "Citroën", "Fiat", "Lancia", "Skoda", "Yamaha", "KTM", "Kawasaki", "Ducati", 
                   "Suzuki"]

res = extraction_info1(liste_marque_BD, liste_marque, title, abstract)
df['Marque'] = res

100%|██████████| 8439/8439 [00:08<00:00, 987.89it/s] 


In [17]:
df[['Id', 'Pays', 'Vehicule', 'Marque']].sample(5, random_state=97)

Unnamed: 0,Id,Pays,Vehicule,Marque
3225,56974258,,Voiture/,
7167,220150242,,Fusée/,
6487,162044413,Norvège/,,
1945,4882323,Mexique/,Avion/,
3251,38632234,,Bateau/,


In [18]:
cpt = 0 
for i in res:
    if (i != ""):
        cpt += 1
print("Il y a %d articles qui mentionnent des marques" %cpt)

Il y a 290 articles qui mentionnent des marques


In [19]:
set(res)

{'',
 'Airbus/',
 'Alstom/',
 'ArianeGroup/',
 'Audi/',
 'BMW/',
 'Boeing/',
 'Bombardier/',
 'Cessna/',
 'Cessna/Ducati/',
 'Daimler/',
 'Ducati/',
 'Fiat/',
 'Fiat/Audi/',
 'Ford/',
 'Ford/Aston Martin/BMW/',
 'Ford/Ducati/',
 'Ford/General Motors/',
 'Ford/General Motors/Daimler/',
 'General Motors/',
 'General Motors/Daimler/',
 'General Motors/Ducati/',
 'Honda/',
 'Hyundai/',
 'ISRO/',
 'Kawasaki/',
 'Kawasaki/Roland Gumpert/',
 'Mazda/',
 'McDonnell Douglas/',
 'Mercedes/',
 'Mercedes/Daimler/',
 'Mercedes/Fiat/',
 'Mercedes/Toyota/BMW/Audi/',
 'MiG/',
 'MiG/Airbus/',
 'Nissan/',
 'Nissan/BMW/Volkswagen/Toyota/Mercedes/Ford/Honda/',
 'Nissan/Toyota/Renault/',
 'PSA/',
 'PSA/Nissan/BMW/Toyota/Renault/',
 'Peugeot/PSA/',
 'Renault/',
 'Roland Gumpert/',
 'Roland Gumpert/BMW/',
 'Roland Gumpert/Ducati/',
 'Rolls-Royce/Toyota/BMW/',
 'SpaceX/',
 'Suzuki/',
 'Toyota/',
 'Toyota/General Motors/Honda/',
 'Toyota/Honda/',
 'Toyota/Honda/Daimler/',
 'Volvo/',
 'Volvo/Daimler/',
 'Yamaha/

### Pls éléments

In [20]:
def extraction_info2(liste_att_BD, liste_att, title, abstract):
    liste = []

    for titre, resume in zip(tqdm(title), abstract): 

        liste_tempo = []
        for att_BD, liste_voc in zip(liste_att_BD, liste_att):
            try :
                liste_abs = [voc for voc in liste_voc if re.findall(voc, resume) != []]
            except : 
                liste_abs = []
            try : 
                liste_tit = [voc for voc in liste_voc if re.findall(voc, titre) != []]
            except :
                liste_tit = []

            liste_res = liste_abs + liste_tit
    
            #tempo = [voc for voc in liste_voc if re.findall(voc, abstract) != []]
            #tempo2 = [voc for voc in liste_voc if re.findall(voc, title) != []]
            
            if (len(liste_res) > 0):
                liste_tempo.append(att_BD)

        res = ""
        for elem in liste_tempo : 
            res = res + elem + "/"

        liste.append(res)
    
    return liste

In [21]:
# Moteurs
combustion_int = ["internal( |\-)combustion engine", "i.c.e", "burn(s| ) * fuel * cylinder(s| )",
                  "reciprocating engine", "piston engine" ]

reaction = ["jet (engine|propulsion)", "turbojet", "propellant ", "projection * fluid", "turbofan",
            "ramjet", "pulse jet"]

pile_combustible = ["fuel cell", "electrochemical cell", "oxidi(z|s)ation", "oxidizing agent",
                    "redox", "chemical energy", "anode", "cathode", "electrolyte"]

hybride = ["hybrid", "hydrogen addition on"]

liste_moteur = [combustion_int, reaction, pile_combustible, hybride]
liste_moteur_BD = ["Combustion interne", "Réaction",  "Pile à combustible", "Hybride"]

res = extraction_info2(liste_moteur_BD, liste_moteur, title, abstract)
df['Moteur'] = res

100%|██████████| 8439/8439 [00:01<00:00, 4486.86it/s]


In [22]:
cpt = 0 
for i in res:
    if (i != ""):
        cpt += 1
print("Il y a %d articles qui mentionnent des moteurs" %cpt)

Il y a 4920 articles qui mentionnent des moteurs


In [23]:
pd.Series(res).value_counts()

                                                           3519
Combustion interne/                                        1992
Pile à combustible/                                        1046
Combustion interne/Pile à combustible/                      941
Combustion interne/Pile à combustible/Hybride/              262
Pile à combustible/Hybride/                                 226
Hybride/                                                    168
Combustion interne/Hybride/                                 140
Réaction/                                                    92
Combustion interne/Réaction/                                 40
Combustion interne/Réaction/Pile à combustible/Hybride/       4
Réaction/Hybride/                                             3
Réaction/Pile à combustible/                                  2
Combustion interne/Réaction/Hybride/                          2
Réaction/Pile à combustible/Hybride/                          1
Combustion interne/Réaction/Pile à combu

In [24]:
# Types de stockages 

gaz = ["compress", "gaseous", "increase * pressure", "high( |-)pressure", "ch2", "cgh2",
       "under( |-)pressure", "storage density", "hydrogen tanks"]

liquide = ["liqu(e|i)f", "liquid (hydrogen|state|form)", "LH2", "cooled ", "critical point", "33( |)k",
           "-253(| )°c", "252(,|.)87", "20(.|,)28( |)k", "parahydrogen", "orthohydrogen", "low temperature",
           "cooling", "chilled"]

solide = ["solid (hydrogen|state|form)", "a(b|d)sorption", "metal alloy", "hydride", "alanate"]

liste_stockage = [gaz, liquide, solide]
liste_stockage_BD = ['Gaz', 'Liquide', 'Solide']

res = extraction_info2(liste_stockage_BD, liste_stockage, title, abstract)
df['Stockage'] = res

100%|██████████| 8439/8439 [00:01<00:00, 5066.70it/s]


In [25]:
cpt = 0 
for i in res:
    if (i != ""):
        cpt += 1
print("Il y a %d articles qui mentionnent des types de stockages" %cpt)

Il y a 2443 articles qui mentionnent des types de stockages


In [26]:
pd.Series(res).value_counts()

                       5996
Solide/                 733
Gaz/                    688
Liquide/                559
Gaz/Liquide/            209
Gaz/Solide/             121
Liquide/Solide/          95
Gaz/Liquide/Solide/      38
dtype: int64

In [27]:
# Types de production 

# 3 types : À partir de composés organiques, à partir de l'eau, à partir phtosynthèse 

# (1) Hydraucarbues
SMR = ["smr", "steam (methane |)reforming", "nickel catalyst", "endothermic reaction", "iron oxide", 
       "reaction * (hydrocarbons|fuels) * water" ]

methane_pyrolysis = ["pyrolysis", "bubble column", "molten metal catalyst", "solid carbon"]

partial_oxidation = ["partial oxidation","( |\()pox( |\)|,|\.)", "substoichiometric", 
                     "(fuel\-air|fuel\-oxygen) mixture", "water-gas shift reaction",
                    "partial. combusted", "partial oxidation reactor"]

plasma_reforming = ["plasma reforming", "kvaerner", "cb&h", "carbon black", "plasma arc waste disposal",
                    "plasma gasification", "plasma converter"]

coal_gasification = ["coal gasification", "break molecular bonds in coal", "coal * water * (air|oxygen)",
                     "gaseous mix of hydrogen and carbon monoxide"]

# (2) Eau
water_electrolysis = ["electrolysis", "(split|decompose) water", "water (splitting|decomposition)", 
                      "potential difference", "electrolyser", "electrolyte", "polymer", "amkaline" ]

electrochemically = ["electrochemically", "(methanol|ethanol|formic acid[glycerol) * electrolys", 
                     "sulfur-iodine cycle", "s-i cycle", "sulfur * iodine", ]

radiolysis = ["radiolysis", "radiolytically"]

thermolysis = ["thermolysis"]

thermochemical = ["thermochemical", "heat sources * chemical reactions"]

ferrosilicon = ["ferrosilicon"]

algae = ["photobiological (water|) splitting", "algae", "photobioreactor", "photosynthesis"]

photocatalytic = ["photocatalytic" , "solar energy to hydrogen", "photoelectrochemical cell", 
                  "artificial photosynthesis" ]

# (3) Biohydrgroène
fermentative = ["bioreactor", "bacteria", "fermentative", "organic (substrate|compound)", "biohydrogen",
                "fermentation"]

enzymatic = ["enzyma", "sugars"]

biocatalysed_electrolysis = ["electrohydrogenesis", "microbial fuel cell", "biocatalysed electrolysis", 
                             "electrohydrogenesis", "electrolysis * microbes" ]



liste_type_pro = [SMR, methane_pyrolysis, partial_oxidation, plasma_reforming, coal_gasification, 
                  water_electrolysis, electrochemically, radiolysis, thermolysis, thermochemical, 
                  ferrosilicon, algae, photocatalytic]
liste_type_prod_BD = ["SMR", "Pyrolyse du méthane", "Oxydation partielle", "Reformage plasma",
                      "Gazéification du charbon", "Electrolyse", "Electrochimie", "Radiolyse", "Thermolyse",
                      "Thermochimie" , "Ferrosilicium", "Culture d'algues", "Fission photocatalytique",
                      "Fermentation", "Production enzymatique", "Electrolyse biocatalysée"]

    
res = extraction_info2(liste_type_prod_BD, liste_type_pro, title, abstract)
df['Production'] = res

100%|██████████| 8439/8439 [00:05<00:00, 1418.41it/s]


In [28]:
cpt = 0 
for i in res:
    if (i != ""):
        cpt += 1
print("Il y a %d articles qui mentionnent des types de productions" %cpt)

Il y a 1848 articles qui mentionnent des types de productions


In [29]:
pd.Series(res).value_counts()

                                                      6591
Electrolyse/                                          1083
SMR/                                                   118
Culture d'algues/                                       86
SMR/Electrolyse/                                        72
                                                      ... 
SMR/Culture d'algues/                                    1
SMR/Pyrolyse du méthane/Electrolyse/                     1
Electrolyse/Thermochimie/Fission photocatalytique/       1
Pyrolyse du méthane/Gazéification du charbon/            1
Pyrolyse du méthane/Electrolyse/Thermochimie/            1
Length: 63, dtype: int64

In [30]:
# Domaines
militaire = ["militar", "army", "soldie", "warfare", "armed forces",
             "navy", "air force", "nato", "defence", "regiment", "fighting"]  # NATO

agriculture = ["agricultur", "cultivat", "agronomy", "harvest", "crops", "tractor", "sowing", "seeding",
               "gmo", "harvest", "plough", "plow", "irrigation", "insecticid", "herbicide",
               "weedkiller", "seed", "farm", "fertilizer", "fertiliser", "cultur"]

astronautique = [" space", "astronautics", "orbit", "atmosphere", "space shuttle", "rocket", "landing",
                 "aerodynamic", "propulsion", "thrust", "cosmonaut", "jet engine", "thales", "boeing", 
                 "take-off ", "takeoff ", "(gknpz|khrounitchev)", "( |'|\.|,)nasa( |'|\.|,)", "spacex", 
                 "avio( |'|\.|,)", "ariane", "united launch alliance", "mcdonnell douglas",
                 "isro", "ioujnoïe"]

aéronautique = ["(aeroplane| plane|airplane)", "aircraft", "aeronautic", "aviation", "helicopter", 
                "aerodynamic", "boeing", "airbus", "(transonic|(in|)compressible) flow",
                "take-off ", "takeoff ", "thales", "dassault", "cessna", "bombardier"]

transport_march = ["charter", " (truck|lorry|semitrailer|rig|juggernaut|hgv|heavy goods vehicle)( |s|,|\.)",
                  " trailer", " container", " freight", " cargo", " shipment", " merchant ship", "service road",
                  " logistics", " handling", " hefting", " merchandise", " goods", " transporter", " haulier",
                  "long-haul * transportation", "haulage contractor", " carr(ying|ier)", "wagon"]

transport_pers = ["(public|mass|urban|school) (transport|transit)", " passenger", " occupant", " traverl(l|)er",
                         " voyager", "carrriage", " (bike|bicycle|yacht|ferry|jet|tram|streetcar|( | motor)bike| motorcycle|\
                         car|automobile|saloon|van|pickup|bus|coach|underground|subwat|metro)( |s|,|\.)", ]
    
politique = [" decree( |s|,|\.)", " law( |s|,|\.)", " legislation", "carbon tax", "paris agreement", "unfccc", " cop", 
            " protocol", "cmp [0-9]", "cop [0-9]", "climate change conference", "subsidy"]

automobile = ["automobile", " (saloon|van|pickup|car)( |s|,|\.)", " motoring( |s|,|\.)", "tailpipe", 
              "exhaust pipe", "horsepower", "dealership", "motor rac(ing|e)", "hyundai", "toyota", 
              "renault", "honda", "mercedes", "( |')audi( |'|\.|,)", "( |'|\.|,)kia( |'|\.|,)", 
              "river( |\-|)simple", "nissan", "( |'|\.|,)ford( |'|\.|,)", 
              "daimler", "bmw", "hopium", "peugeot", "volkswagen", "general motors", "( |'|\.|,)psa( |'|\.|,)",  "(roland gumpert|apollo automobil|gmbh)", 
              "mazda", "aston martin", "pininfarina", "suzuki", "volvo", "( |'|\.|,)opel( |'|\.|,)", "rolls-royce", "citroen", "( |'|\.|,)fiat( |'|\.|,)", 
              "( |'|\.|,)lancia( |'|\.|,)", "skoda", "yamaha", "( |'|\.|,)ktm( |'|\.|,)", "kawasaki", "ducati", "suzuki"]

ferroviaire = [" rail", " train( |s|,|\.)", "alstom", "locomotive", "interurbain", "wagon", "freight car", "carrriage", ]

energies_renouv = ["solar (panel|energy)", "wind (power|energy)", "hydropower", "biofuel" ,"renewable (energy|resource)", "biomass", "firewood", "tidal energy"]

energies_fossiles = ["fossil (energy|fuel)", " gas ", " oil ", "petroleum", " coal ", " fuels ", " hydrocarbons"]

performance = [" horsepower", " record ", " potency ", " optimum ", " maximum ", " performance ", "battery life", 
               "yield", "profitability", "cost-effectiveness", "financial viability", "investment"]

chimie = [" chemi*", " substances", " elementary form", " matter ", " atomic ", "microscop*", "catalyst", 
          "colloid", "isomer", "reactant", "satureted", "solubility", "solvent", "(hetero|homo)geneous mixture"]


liste_domaine = [militaire, agriculture, astronautique, aéronautique, transport_march, transport_pers, 
                 politique, automobile, ferroviaire, energies_renouv, energies_fossiles, performance, chimie]

liste_domaine_BD = ['Militaire', 'Agriculture', 'Astronautique', "Aéronautique", "Transport Marchandises", 
                   "Transport Personnes", "Politique", "Automobile", "Ferroviaire", "Energies Renouvelables", 
                    "Energies Fossiles", "Performance", "Chimie"]


res = extraction_info2(liste_domaine_BD, liste_domaine, title, abstract)
df['Domaine'] = res

100%|██████████| 8439/8439 [00:23<00:00, 360.14it/s]


In [31]:
cpt = 0 
for i in res:
    if (i != ""):
        cpt += 1
print("Il y a %d articles qui mentionnent des domaines" %cpt)

Il y a 7918 articles qui mentionnent des domaines


In [32]:
for cat in liste_domaine_BD:
    cpt = 0
    for i,j in zip(pd.Series(res).value_counts().index, pd.Series(res).value_counts()):
        if (cat in i):
               cpt += j
    print(cat, cpt)

Militaire 192
Agriculture 619
Astronautique 1705
Aéronautique 1869
Transport Marchandises 1177
Transport Personnes 1264
Politique 707
Automobile 1338
Ferroviaire 212
Energies Renouvelables 1761
Energies Fossiles 3620
Performance 2380
Chimie 3360


In [33]:
df.head()

Unnamed: 0,Id,Date,Auteur,Titre,Langue,MotCle,Pays,Vehicule,Marque,Moteur,Stockage,Production,Domaine
0,6334032,,Hezlin Ashraf-Ball and Andrew J. Oswald and Ja...,Hydrogen Transport and the Spatial Requirement...,English,Renewable energy / wind power / land use / ene...,,,,,Gaz/,,Energies Renouvelables/Energies Fossiles/
1,48223,2009-05-01,Andrew J. Oswald and James I. Oswald and Hezli...,Hydrogen transport and the spatial requirement...,English,,,,,,Gaz/,,Energies Renouvelables/Energies Fossiles/
2,52955835,2011-01-01,PeiYuan Hsu and Xu Yang and Joshua L. Dibia an...,Design of Residential Hydrogen Fueling System ...,,Faculty research day/,,Voiture/,,,Gaz/,,Automobile/Energies Renouvelables/Energies Fos...
3,70657172,2008-09-24,Maria Antónia Travassos and A. I. Correia de S...,Penetration of hydrogen technologies: study on...,,Road transport-Portugal/ Pollutant emissions/ ...,Portugal/,Voiture/,,,,,Transport Personnes/Politique/Automobile/
4,40069044,2007-01-01,Ivo Veldhuis,Application of hydrogen marine systems in high...,,130 - Mechanical/ industrial/ civil and marine...,,Bateau/,,,Liquide/,,Aéronautique/Transport Marchandises/Transport ...


In [34]:
df.to_csv("df_scraping_final.csv", index=False)

In [None]:
# Modif schéma 
# rajouter marque ?  

In [None]:
# Pb : 
# auteur => différencier nom prénom 


# Si pls véhicules et plus types de moteur => comment les liés ? (idem avec technologie)
# Solutions : 
# - soit séparer véhicule/tehchnologie/moteur => pour lié technologie au véhicule (ou véhicule / moteur) => faire jointure avec article où il n'y a qu'un seul type de véhicule !!!!!!!!

In [None]:
# for pays, pays_BD in zip(liste_pays, liste_pays_BD):
#     print(pays)
#     re.findall(pays, "jkfefg")