Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geolocate user IP addresses when presenting them in UI #8158

Open
7 of 10 tasks
di opened this issue Jun 24, 2020 · 20 comments
Open
7 of 10 tasks

Geolocate user IP addresses when presenting them in UI #8158

di opened this issue Jun 24, 2020 · 20 comments
Labels
feature request meta Meta issues (rollouts, etc)

Comments

@di
Copy link
Member

di commented Jun 24, 2020

What's the problem this feature will solve?
Currently in the PyPI logged-in UI, we show the IP address that performed certain actions to the user:

Screen Shot 2020-06-24 at 6 37 55 PM

I don't know my own IP offhand. Especially if there are multiple different IPs listed here, I would need to manually look up the approximate location where these came from to get an idea of whether they were actually me or not.

Describe the solution you'd like
It would be nice if PyPI also showed me an (approximate) location for any given IP address as well, so I could easily visually filter ones that seem incorrect, e.g.:

Event Date / time IP address
Logged in less than 10 seconds ago 11.22.11.22 (Austin TX USA)
Logged in June 22, 2020 22.33.22.33 (Austin TX USA)
Logged in June 19, 2020 44.55.44.55 (Timbuktu, Mali)
Logged in June 19, 2020 66.77.66.77 (Austin TX, USA)

Additional context
This shouldn't require external API calls. Using something like https://pypi.org/project/geoip2/ with an embedded database like https://dev.maxmind.com/geoip/geoip2/geolite2/ would probably work.

Ideally this would be determined on the fly and not stored anywhere (e.g. along with the IP address), so if we someday replaced the mechanism with something more precise (or just updated the embedded DB) the updates would be immediately reflected.

Todo list

  • Use Fastly's geolocation services to determine geographic location at edge
  • Hash & salt IP addresses at edge, pass those to our backends/logs (populate a X-PyPI-Hashed-IP header.
  • Replace IP addresses in gunicorn logs
  • Begin storing hashed IPs everywhere for all events
  • Replace IP addresses in the user-facing UI (user/project events) with corresponding geolocation data
  • Replace IP addresses in journals with corresponding hashed IP
  • Drop submitted_from column from journals table (duplicated in ProjectEvent)
  • For all events tables, change IP columns to CIText, retroactively hash IP addresses (geoIP data will be missing)
  • For Admin IP addresses table, data migration to retroactively hash the IP addresses.
  • Drop X-Fastly-IP header at edge
@sanjaysiddhanti
Copy link
Contributor

@di I can help with this. A couple questions:

  • I downloaded the Geolite2 city DB and the file size is 60M. I don't think it's a good idea to check this into the git repo. Does Warehouse use S3 or something similar where we could store this?

  • Related to the above, how do you recommend making the DB available to the application? We could pull it from S3 during the Docker build, but I wonder if there's a better option like mounting it as a volume when the container starts. I'm not familiar with how Warehouse is deployed (though I did read https://warehouse.readthedocs.io/application/), so let me know how you typically handle this.

@di
Copy link
Member Author

di commented Jun 25, 2020

I downloaded the Geolite2 city DB and the file size is 60M. I don't think it's a good idea to check this into the git repo.

I'm surprised it's that large -- we might want to see if there are more lightweight options, or whether we can slim it down at all (IIUC it probably contains a lot of data we don't need). That said, we check in the development database which clocks in at more than 60MB, so this might be OK for a one-time thing.

Does Warehouse use S3 or something similar where we could store this?

We use a datastore to store PyPI's files, but it wouldn't really be appropriate to put this there. In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.

Related to the above, how do you recommend making the DB available to the application?

I think the easiest thing to do would be to add it into the repo and pull it in from there. Given the size though, I'm a little hesitant to say that's the best option.

@sanjaysiddhanti
Copy link
Contributor

I'm looking into lighter weight options, taking inspiration from other libraries. There are also some recent license changes to the Geolite2 DB that we'll need to review, but I'll first look at other options

@sanjaysiddhanti
Copy link
Contributor

sanjaysiddhanti commented Jun 26, 2020

I looked at db-ip's City DB. It has a more permissive license, but it's even bigger - 85M.

Both providers offer a CSV format but in both cases, CSV is bigger than the corresponding MMDB file.

I don't know how we would slim down the MMDB file, and curating the CSV file seems like a lot of work, especially since they release regular updates and we may not want to be locked in to the version that we curate.

So it sounds like we can check one of the MMDB files into the git repo, or make an external API call - what was the reason that you didn't like the API call?

@di
Copy link
Member Author

di commented Jun 26, 2020

what was the reason that you didn't like the API call?

Potential added expense / external dependency, probably not worth it for this very small feature. Unless we could do this entirely on the frontend, in JS, for free... is that an option?

Agreed we probably don't want to curate CSV files.

@sanjaysiddhanti
Copy link
Contributor

Hmm. We could do it in JS if the user grants access to their location, but then we'd need to store that in a DB to look back at it for future logins. To get the location just from IP, I think we'd still need an API call from the JS code.

There are also country DBs that are much smaller (the Geolite country DB is less than 4M). But I don't think it helps us much to display the country of the user?

@di
Copy link
Member Author

di commented Jul 2, 2020

Ah, I meant call some API from JS, not correlate the user's location from their browser w/ their IP.

Another consideration for not using an API is maintaining privacy, i.e. keeping all the IPs w/in Warehouse.

I think just displaying the country is probably too vague to be useful.

@SanketDG
Copy link

SanketDG commented Jul 2, 2020

Ah, I meant call some API from JS

If you are talking about a REST API (and not a library), wouldn't that also route all the IPs to an external location?

@pradyunsg
Copy link
Contributor

pradyunsg commented Jul 2, 2020

@di Is there some reason (perhaps legal) that we can't have the Geolite2 or db-ips actual databases in a Python Package that we make a dependency of warehouse, and not add them into this repository directly?

If we can do that, I feel like we should since we could have it be updated at some appropriate cadence and, more importantly, avoid making the git repository for this project bigger.

@di
Copy link
Member Author

di commented Jul 2, 2020

From #8158 (comment):

In the best possible case, this would be a package on PyPI we could just add as a dependency, but I couldn't find anything that included it's own database, just libraries that talked to external APIs.

@di
Copy link
Member Author

di commented Jul 2, 2020

And yes, I'm assuming we wouldn't be allowed to redistribute it.

@patelneel55
Copy link
Contributor

Assuming that Warehouse can't store and redistribute the db, there is a public BigQuery table under fh-bigquery.geocode.201806_geolite2_city_ipv4_locs which contains the data from geolite2. I'm not sure if this counts but technically its not a API call from JS and ensures that the calls are made from the backend during user events and the addresses stay within Warehouse. https://cloud.google.com/blog/products/data-analytics/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds#jump-content:~:text=Geolocating%20one%20IP%20address%20out%20of%20millions

@pradyunsg
Copy link
Contributor

pradyunsg commented Jul 9, 2020

And yes, I'm assuming we wouldn't be allowed to redistribute it.

https://db-ip.com/db/download/ip-to-country-lite is under https://creativecommons.org/licenses/by/4.0/, which does allow redistribution.

That's not the case for Geolite's dB though -- they changed licensing last year for California Consumer Privacy Act (CCPA) compliance: https://blog.maxmind.com/2019/12/18/significant-changes-to-accessing-and-using-geolite2-databases/

@di
Copy link
Member Author

di commented Mar 31, 2023

Some ideas here on implementing this with more privacy-protecting features around IP addresses as well:

  • Use Fastly's geolocation services to determine geographic location at edge
  • Hash & salt IP addresses at edge, pass those to our backends/logs (drop X-Fastly-IP header, populate a X-PyPI-Hashed-IP header.
  • Replace IP addresses in the user-facing UI (user/project events) with corresponding geolocation data
  • Replace IP addresses in journals with corresponding hashed IP
  • For all events tables, change IP columns to CIText, retroactively hash IP addresses (geoIP data will be missing)
  • For Admin IP addresses table, data migration to retroactively hash the IP addresses.

@di
Copy link
Member Author

di commented Apr 7, 2023

GeoIP and salting at edge are done in pypi/infra#123

@di
Copy link
Member Author

di commented Apr 7, 2023

Logging salted IPs are done in #13389

@ewdurbin
Copy link
Member

We now display GeoIP information if available: #13745

@ewdurbin ewdurbin reopened this May 25, 2023
@ewdurbin
Copy link
Member

Ope, missed that this was a meta issue.

@ewdurbin
Copy link
Member

Begin storing hashed IPs everywhere for all events: #13716, #13744

Replace IP addresses in the user-facing UI (user/project events) with corresponding geolocation data #13745

@di di added the meta Meta issues (rollouts, etc) label May 25, 2023
@ewdurbin
Copy link
Member

submitted_from column dropped from journals table in #13751 and #13752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request meta Meta issues (rollouts, etc)
Projects
None yet
Development

No branches or pull requests

6 participants