Datasets with programming languages info

The goal of this repository is mining information to create datasets about programming languages.

Now the dataset has more than 600 languages,

which include the website of the languages, creation date, your paradigms, and type systems.

Besides, I have the goal to include information about the trends of each language, so, feels free to send suggestions about how to do it, or make it and send a pull request.

Using the dataset

The following code query the newest programming languages:

>>> from datasets import languages
>>> languages.sort_values('first_release', ascending=False, inplace=True)
>>> languages[['name', 'first_release']].head()

               name  first_release
494  project verona           2019
65           bosque           2019
582          source           2017
507              q#           2017
51        ballerina           2017

If you want to see more examples of the usage, see this notebook in the github, or here in google colab.

How to use the dataset

The dataset is stored in .csv format inside the datasets directory, so, you only need to paste the link of the file:

import pandas as pd
df_link = 'https://raw.githubusercontent.com/raulpy271/languagesDataset/main/datasets/all_languages.tsv'
df = pd.read_csv(df_link, sep='\t')

The above code can be used in Jupyter, in google colab, or in whatever environment that you have since you have pandas installed.

Another option is to clone this repository and imports the datasets from the top-level package:

from datasets import languages

How to setup the script

If you want to run this module to create the dataset with languages you need to install the dependencies and setup some configuration.

To install the dependencies, clone the repo and type in your terminal:

pip install -r requirements.txt

After installing the dependencies, you should configure the following:

This module use selenium to communicate with a web browser and navigate through the sites, so, you should install a web driver for help selenium to communicate with you browser, see this tutorial if you don't know.

After the download of your driver, you should tell the selenium where are the binaries of the driver and the browser, to make it, change the function get_driver, which create instances of a driver.

After making the bellow configuration, you can run the module:

python main.py

With this command the script will navigate through Wikipedia searching all languages info, after the end of the process, the datasets will be saved in a path defined in the consts.py file, you can change it.

Besides, if you want only to test the script and you don't want to wait for the entire process, so there is a way to search only the first languages. The way is defining an environment variable called TESTING which has a True value. To define this variable use the .env file.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
assets		assets
datasets		datasets
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
queries_examples.ipynb		queries_examples.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasets with programming languages info

Using the dataset

How to use the dataset

How to setup the script

About

Releases

Packages

Contributors 2

Languages

License

raulpy271/languagesDataset

Folders and files

Latest commit

History

Repository files navigation

Datasets with programming languages info

Using the dataset

How to use the dataset

How to setup the script

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages