From this link was scraped a list of chemicals compounds which counts 1976 rows.(https://en.wikipedia.org/wiki/List_of_CAS_numbers_by_chemical_compound) This is a list of CAS numbers by chemical formulas and chemical compounds, indexed by formula.The CAS number is a unique number applied to a specific chemical by the Chemical Abstracts Service (CAS).

In [None]:
!pip install pandas beautifulsoup4 # To execute folowing codes, you'll need to install the requests and beautifulsoup4 libraries.




In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_CAS_numbers_by_chemical_compound"

# Send an HTTP request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all tables on the page
    tables = soup.find_all('table')

    # Initialize an empty list to store DataFrames
    all_dfs = []

    # Iterate through each table
    for i, table in enumerate(tables):
        # Extract data from the table
        data = []
        for row in table.find_all('tr')[1:]:  # Skip the header row
            columns = row.find_all(['th', 'td'])
            data.append([col.get_text(strip=True) for col in columns])

        # Create a DataFrame from the extracted data
        df = pd.DataFrame(data, columns=["Chemical formula", "Chemical Name", "CAS number"])

        # Append the DataFrame to the list
        all_dfs.append(df)

        # Print the DataFrame for each table
        print(f"Table {i + 1}:\n{df}\n{'=' * 50}\n")

    # Concatenate all DataFrames into one
    final_df = pd.concat(all_dfs, ignore_index=True)

    # Print the final concatenated DataFrame
    print("Final Concatenated DataFrame:\n", final_df)

else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


Table 1:
    Chemical formula                Chemical Name  CAS number
0              Ac2O3          actinium(III) oxide  12002-61-8
1                 Ag                       silver   7440-22-4
2            AgAlCl4  silver tetrachloroaluminate  27039-77-6
3               AgBr               silver bromide   7785-23-1
4             AgBrO3               silver bromate   7783-89-3
..               ...                          ...         ...
114       Au2(SeO4)3           gold(III) selenate  10294-32-3
115           Au2Se3           gold(III) selenide   1303-62-4
116           Au4Cl8         gold(I,III) chloride  62792-24-9
117            Au4F8         gold(I,III) fluoride  14270-21-4
118                                          None        None

[119 rows x 3 columns]

Table 2:
    Chemical formula          Chemical Name  CAS number
0                BAs         boron arsenide  12005-69-5
1              BAsO4         boron arsenate  13510-31-1
2               BBr3       boron tribromide  

In [None]:
final_df.head() # how it looks that table from wikipedia

Unnamed: 0,Chemical formula,Chemical Name,CAS number
0,Ac2O3,actinium(III) oxide,12002-61-8
1,Ag,silver,7440-22-4
2,AgAlCl4,silver tetrachloroaluminate,27039-77-6
3,AgBr,silver bromide,7785-23-1
4,AgBrO3,silver bromate,7783-89-3


In [None]:
final_df.info() # info about that table

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1976 entries, 0 to 1975
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Chemical formula  1976 non-null   object
 1   Chemical Name     1975 non-null   object
 2   CAS number        1975 non-null   object
dtypes: object(3)
memory usage: 46.4+ KB


In [None]:
final_df.to_csv('/content/final_df.csv', index = False)# we downloaded this table in excel format, you have it in the folder

In [None]:
chemicals = pd.read_csv('/content/df_TIM1_FINAL.csv')
chemicals.head() # Task1 the first way, when we handle the Kungul database

Unnamed: 0,IngredientIdentifier,Name,Carcinogens,EndocrineDisruptors,Allergen,SkinIrritant,Synonyms
0,G00001,saccharomycesleuconostocapple fruitcarrot root...,False,False,False,False,
1,G00002,lactobacilluscentella asiaticagleditsia sinens...,False,False,False,False,
2,G00003,bacilluscordyceps sinensisganoderma lucidumino...,False,False,False,False,
3,G00004,ziziphus spinachristi leaf,False,False,False,False,
4,G00005,zingiber officinale water,False,False,False,False,


In [None]:
chemicals_synonyms = chemicals.copy()

In [None]:
import pandas as pd
import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors
from tqdm import tqdm  # For progress bar

PubChemPy is a Python library that provides a programmatic interface to access chemical information from the PubChem database. PubChem is a free chemistry database maintained by the National Center for Biotechnology Information (NCBI), which is part of the National Institutes of Health (NIH).

With PubChemPy, you can retrieve information about chemical compounds, search for compounds based on various criteria, and access data such as chemical properties, identifiers, and biological activities.

In [None]:
!pip install pubchempy

Collecting pubchempy
  Downloading PubChemPy-1.0.4.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pubchempy
  Building wheel for pubchempy (setup.py) ... [?25l[?25hdone
  Created wheel for pubchempy: filename=PubChemPy-1.0.4-py3-none-any.whl size=13819 sha256=765c75134c9c36fe33bb759f651f104b39971b613178361f798b2381950b229f
  Stored in directory: /root/.cache/pip/wheels/90/7c/45/18a0671e3c3316966ef7ed9ad2b3f3300a7e41d3421a44e799
Successfully built pubchempy
Installing collected packages: pubchempy
Successfully installed pubchempy-1.0.4


This is an example how we can access a data for certain chemical compound.

In [None]:
import pubchempy as pcp

def get_synonyms(compound_name):
    # Search for the compound by name
    compound = pcp.get_compounds(compound_name, 'name', record_type='3d')[0]

    # Get synonyms from the compound data
    synonyms = compound.synonyms

    return synonyms

# Example usage
compound_name = "aspirin"
synonyms = get_synonyms(compound_name)

# Print the synonyms
print(f"Synonyms for {compound_name}: {synonyms}")


Synonyms for aspirin: ['aspirin', 'ACETYLSALICYLIC ACID', '50-78-2', '2-Acetoxybenzoic acid', '2-(Acetyloxy)benzoic acid', 'Acetylsalicylate', 'O-Acetylsalicylic acid', 'o-Acetoxybenzoic acid', 'Acenterine', 'Acetophen', 'Acetosal', 'Acylpyrin', 'Easprin', 'Ecotrin', 'Salicylic acid acetate', 'Acetosalin', 'Aspirdrops', 'Polopiryna', 'Salcetogen', 'Aceticyl', 'Acetonyl', 'Acetylin', 'Acidum acetylsalicylicum', 'Benaspir', 'Colfarit', 'Empirin', 'Endydol', 'Measurin', 'Rhodine', 'Saletin', 'o-Carboxyphenyl acetate', 'Enterosarein', 'Enterosarine', 'Acetisal', 'Acetylsal', 'Aspirine', 'Bialpirinia', 'Entericin', 'Enterophen', 'Micristin', 'Pharmacin', 'Premaspin', 'Salacetin', 'Solpyron', 'Temperal', 'Acesal', 'Acisal', 'Asagran', 'Asteric', 'Duramax', 'Ecolen', 'Extren', 'Globoid', 'Helicon', 'Idragin', 'Rhonal', 'Aspro', 'Novid', 'Rheumintabletten', 'Yasta', 'Solprin acid', 'Benzoic acid, 2-(acetyloxy)-', 'Acimetten', 'Bialpirina', 'Claradin', 'Clariprin', 'Delgesic', 'Entrophen', 'Glo

The same logic we apply here, when we want to access synonyms for every chemical compound/name in our database, previos updated.

In [None]:
import pandas as pd
import pubchempy as pcp

def get_synonyms(chemical_name):
    try:
        # Search for the compound by name
        compound = pcp.get_compounds(chemical_name, 'name', record_type='3d')

        if compound:
            # If compound is found, return synonyms
            return compound[0].synonyms
        else:
            # If compound is not found, return an empty list
            return []
    except pcp.PubChemHTTPError as e:
        # Handle PubChemHTTPError when there is an issue with the PubChem server
        print(f"PubChemHTTPError: {e}")
        return []
    except Exception as e:
        # Handle other exceptions
        print(f"Error: {e}")
        return []

# Create an empty list to store synonyms
all_synonyms = []

# Iterate through each chemical name in the 'Name' column
for chemical_name in chemicals['Name']:
    # Get synonyms for the current chemical name
    synonyms = get_synonyms(chemical_name)

    # Append synonyms to the list
    all_synonyms.append(synonyms)

# Print or use the list of synonyms as needed
print(all_synonyms)


Error: 'float' object is not iterable


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



all_synonyms_df is dataframe which is made by previous code with access to PubChem database, and it contains synonyms for approximately 1300 chemical compounds.

In [None]:
all_synonyms_df = pd.DataFrame({'Synonyms': all_synonyms}) # make a dataframe from the output

# Concatenate along columns (axis=1)
final = pd.concat([chemicals_synonyms, all_synonyms_df], axis=1) # make a concatenation with our previos database

Kako izgleda dataframe na sostojki sinonimi koi gi dobivme so pomosh na webscraping.

In [None]:
all_synonyms_df

Unnamed: 0,Synonyms
0,[]
1,[]
2,[]
3,[]
4,[]
...,...
14716,[]
14717,[]
14718,"[1-propanol, propanol, Propan-1-ol, Propyl alc..."
14719,[]


In [None]:
import pandas as pd

# Create a DataFrame
all_synonyms_df = pd.DataFrame({'Synonyms': all_synonyms})

# Save to CSV
all_synonyms_df.to_csv('/content/all_synonyms.csv', index=False) # we downloaded this table in excel format, you have it in the folder


In [None]:
all_synonyms_df = pd.DataFrame({'Synonyms': all_synonyms})

# Concatenate along columns (axis=1)
final = pd.concat([chemicals_synonyms, all_synonyms_df], axis=1) # we downloaded this table in excel format, you have it in the folder


In [None]:
final.head() # There are two columns with synonyms because the first one was obtained using the Word2Vec algorithm in the previous method of database handling,
# we can drop one if we like

Unnamed: 0,IngredientIdentifier,Name,Carcinogens,EndocrineDisruptors,Allergen,SkinIrritant,Synonyms,Synonyms.1
0,G00001,saccharomycesleuconostocapple fruitcarrot root...,False,False,False,False,,[]
1,G00002,lactobacilluscentella asiaticagleditsia sinens...,False,False,False,False,,[]
2,G00003,bacilluscordyceps sinensisganoderma lucidumino...,False,False,False,False,,[]
3,G00004,ziziphus spinachristi leaf,False,False,False,False,,[]
4,G00005,zingiber officinale water,False,False,False,False,,[]


In [None]:
final.to_csv('/content/final.csv', index = False) # we downloaded this table in excel format, you have it in the folder

In [None]:
final.info

Synonyms      502
Synonyms    14721
dtype: int64

In [None]:
final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14721 entries, 0 to 14720
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   IngredientIdentifier  14721 non-null  object
 1   Name                  14720 non-null  object
 2   Carcinogens           14721 non-null  bool  
 3   EndocrineDisruptors   14721 non-null  bool  
 4   Allergen              14721 non-null  bool  
 5   SkinIrritant          14721 non-null  bool  
 6   Synonyms              502 non-null    object
 7   Synonyms              14721 non-null  object
dtypes: bool(4), object(4)
memory usage: 517.7+ KB


## We were not able to link the CAS numbers with our database; these are different compounds with differences in names. Web scraping is not always allowed to work with and databases are protected. We used web scraping to match synonyms with the chemicals names we already have, and have updated previosly. For further improvement, it remains to complete the database and find a way to link CAS numbers for each compound. Regarding Task 2, several libraries and algorithms for word matching were tried, and the one that remained yielded the best results. Training the model and accuracy checks take days. For the time we had as the conclusion, we can offer this.