Skip to content

๐Ÿ“Š I created a dataset with over 600 programming languages information

License

Notifications You must be signed in to change notification settings

raulpy271/languagesDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Datasets with programming languages info

Script mining data


The goal of this repository is mining information to create datasets about programming languages.

Now the dataset has more than 600 languages,

which include the website of the languages, creation date, your paradigms, and type systems.

Besides, I have the goal to include information about the trends of each language, so, feels free to send suggestions about how to do it, or make it and send a pull request.

Using the dataset

The following code query the newest programming languages:

>>> from datasets import languages
>>> languages.sort_values('first_release', ascending=False, inplace=True)
>>> languages[['name', 'first_release']].head()

               name  first_release
494  project verona           2019
65           bosque           2019
582          source           2017
507              q#           2017
51        ballerina           2017

If you want to see more examples of the usage, see this notebook in the github, or here in google colab.

How to use the dataset

The dataset is stored in .csv format inside the datasets directory, so, you only need to paste the link of the file:

import pandas as pd
df_link = 'https://raw.githubusercontent.com/raulpy271/languagesDataset/main/datasets/all_languages.tsv'
df = pd.read_csv(df_link, sep='\t')

The above code can be used in Jupyter, in google colab, or in whatever environment that you have since you have pandas installed.

Another option is to clone this repository and imports the datasets from the top-level package:

from datasets import languages

How to setup the script

If you want to run this module to create the dataset with languages you need to install the dependencies and setup some configuration.

To install the dependencies, clone the repo and type in your terminal:

pip install -r requirements.txt

After installing the dependencies, you should configure the following:

This module use selenium to communicate with a web browser and navigate through the sites, so, you should install a web driver for help selenium to communicate with you browser, see this tutorial if you don't know.

After the download of your driver, you should tell the selenium where are the binaries of the driver and the browser, to make it, change the function get_driver, which create instances of a driver.

After making the bellow configuration, you can run the module:

python main.py

With this command the script will navigate through Wikipedia searching all languages info, after the end of the process, the datasets will be saved in a path defined in the consts.py file, you can change it.

Besides, if you want only to test the script and you don't want to wait for the entire process, so there is a way to search only the first languages. The way is defining an environment variable called TESTING which has a True value. To define this variable use the .env file.