Skip to content

safe-refuge/safeway-data

Repository files navigation

CI

safeway-data

Data mining tools for Safeway app.

Overview

A team of volunteers is building an app for refugees to find help in Europe and avoid human trafficking. It's a project that will have a big impact and save lives. 🇺🇦

The app development is done, we need to load more data about helping points into the app. Volunteers collected a lot of points into a spreadsheet and we've written a tool in Python to convert it into a CSV format suitable for importing into the app database. But there are still some enhancements to the data pipeline and new web scraping spiders needed.

Installation

Make sure you have Python 3.9+ and Poetry installed:

pip install poetry

Then, clone and init the project:

git clone git@github.com:littlepea/safeway-data.git
cd safeway-data
poetry install

Preparation

In order to access Google Sheets, you'll need to prepare some secrets such as DEVELOPER_KEY.

You can place them in config/.env file menually or ask @littlepea to provide you a file with secrets.

If you want to fill out the secrets manually you can start from this template:

cp config/.env.example config/.env

Usage

Check the --help for the CLI:

❯ poetry run python main.py --help
Usage: main.py [OPTIONS]

Options:
  --dry-run / --no-dry-run        [default: no-dry-run]
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.
  --help                          Show this message and exit.

Run the actual conversion script:

❯ poetry run python main.py          
Loading records from spreadsheet 1Y1QLbJ6gvPvz8UI-TTIUUWv5bDpSNeUVY3h-7OV6tj0
Saved 272 results into data/output.csv

If you have any questions, contact @littlepea

Data flow

Where are two main ways this CLI tool gets used:

1) Converting data from a Google Sheet to CSV

❯ poetry run python main.py --spreadsheet-id 1Y1QLbJ6gvPvz8UI-TTIUUWv5bDpSNeUVY3h-7OV6tj0

This will run the convert_spreadsheet method with the following steps:

  • Fetch list of spreadsheet rows from a Google Sheet
  • Transform list of spreadsheet rows to list of Points of Interest
  • Optionally, sanitize addresses
  • Find missing coordinates by geocoding addresses
  • Translate city names to English
  • Validate the final list of points
  • Save points to a CSV file

2) Enhancing CSV data (scraped via spiders)

When we scrape points of interests using spiders (see below) we save results in CSV (as points of interest) and then we need to enhance them similar to step 1 above.

❯ poetry run python main.py --input-file data/france_red_cross.csv

This will run the convert_file method with the following steps:

  • Fetch list of points of interest from the input CSV file
  • Optionally, sanitize addresses
  • Find missing coordinates by geocoding addresses
  • Translate city names to English
  • Validate the final list of points
  • Save points to a CSV file

Running tests

❯ poetry run pytest
Test session starts (platform: darwin, Python 3.9.12, pytest 7.1.2, pytest-sugar 0.9.4)
collecting ... 
 tests/test_spreadsheet_adapter.py ✓                                                                                                                                                                      50% █████     
 tests/test_convert_data.py ✓                                                                                                                                                                            100% ██████████

Results (0.34s):
       2 passed

Running web scrapers

All the Scrapy spiders are in the scraping directory.

You can run a specific spider by supplying the name and output file:

poetry run scrapy crawl dopomoga -o data/dopomoga.csv

Creating new spiders

You can place your new spiders into scraping/spiders directory and implement according to the Scrapy tutorial.

It's highly recommended to add unit tests for your spider's parse method.

Using VS Code

VS Code does not immediately recognize the virtual environment location

to make it work (and so imports are properly recognized)

click Run => add configuration and select Python from the list

this will add a configuration launch.json

you will need to add one line to this configuration

"env": {"PYTHONPATH": "${workspaceRoot}"}

it should look something like this

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "env": {"PYTHONPATH": "${workspaceRoot}"},
            "console": "integratedTerminal",
            "justMyCode": true
        }
    ]
}

About

Data mining tools for the Safeway app

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •