Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion #229

Closed
samayo opened this issue Jun 22, 2022 · 62 comments
Closed

Discussion #229

samayo opened this issue Jun 22, 2022 · 62 comments

Comments

@samayo
Copy link
Owner

samayo commented Jun 22, 2022

Hello, this is to discuss about new major change to the repo.

I am trying to remove most countries not recognized by UN.
Currently, there are 248 countries in this repo, but the UN recognizes only 193 of them, so this will be a big change.

Other than that, I will fill all data for each country (so, no null or empty values)

All data will be also automated (to be updated each week whenever something changes in the source like wikipedia)

Let me know if you like to keep this repo as per the UN recognized countries only

@jezmck
Copy link

jezmck commented Jul 22, 2022

I think that list is a valid requirement for some people, but not everyone.

I'd add that as a new list within this repo.

@iamdoubz
Copy link
Contributor

iamdoubz commented Aug 8, 2022

Just add another file in src called "recognized-un-country.json" with a 1 or 0 value. This will keep the existing structure and pushed the responsibility to the person(s) creating their application. Hope this helps.

@samayo
Copy link
Owner Author

samayo commented Aug 8, 2022

Just add another file in src called "recognized-un-country.json" with a 1 or 0 value. This will keep the existing structure and pushed the responsibility to the person(s) creating their application. Hope this helps.

Thanks, but I don't think it would be nice to keep the existing structure. Some src files have more entries than others. The idea is for all files to contain all 193 countries in the same order, so if you want to get multiple data of one country from all files, it would be very convenient.

@kennarddh
Copy link

Does this data currently scrapped from wikipedia? if yes is it automated?

@samayo
Copy link
Owner Author

samayo commented Aug 4, 2023

Yes it's scrapped of Wikipedia mostly. Automating the process has been the goal for a long but I can't find much time that's why it's not implemented

@kennarddh
Copy link

kennarddh commented Aug 4, 2023

I can implement the automation but I still don't understand the wikipedia data.

https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

In this wikipedia I don't understand what is the difference between numbered country and - country?

image

@samayo
Copy link
Owner Author

samayo commented Aug 4, 2023

Note: A numbered rank is assigned to the 193 member states of the United Nations, plus the two observer states to the United Nations General Assembly. Dependent territories and constituent countries that are parts of sovereign states are not assigned a numbered rank

So numbered are officially recognised countries in numbered are disputed like Taiwan for example.

This repo should focus only on recognised countries

@kennarddh
Copy link

@samayo should this repo include the two observer states?

@samayo
Copy link
Owner Author

samayo commented Aug 4, 2023

Yes I think that would be ok

@kennarddh
Copy link

kennarddh commented Aug 4, 2023

Proposed Changes

Update

  • This will be very big breaking change.
  • This will also increase the data size as more fields and data is added.
  • To reduce size we can remove null value. Like if the country doesn't have the data instead of null we delete the data to reduce size. API library can solve this to return null instead of error when getting the data.
  • Data maybe cannot be modified manually as its will be overridden every month.
  • Data will scrapped mostly from wikipedia and wikimedia. I still can't find some data.
  • Boolean will be used instead of string 1 or 0
  • Maybe we can use object like { 3LetterCountryCode: data } instead of [{ country: name, data: data }] to reduce size.
  • Periodic update will create new pull request and still need to be reviewed manually every month. To prevent if wikipedia changed its page html structure.

Removed

  • Alphabet Letters: Do we need this data, this seems redundant.
  • Country name: Countries data can be used instead.

Added

  • Countries data
  • Currency symbol
  • National sport
  • Many new fields

Changed

  • Change Barcode prefix to GS1Code
  • Rename Country By Abbreviation to ISO3166. Add 2 and 3 letter code.
  • Rename Currency code to ISO4217.
  • Average Height
  • Rename Domain tld to ccTLD and add new fields.
  • Rename elevation to averageElevarion

Source

Note

@samayo
Copy link
Owner Author

samayo commented Aug 4, 2023

Great point thanks for all the help so far you are making this easy even if I want to implement it.

Some notes:

Maybe we can use object like { 3LetterCountryCode: data } instead of [{ country: name, data: data }] to reduce size.

Lets leave the above as is for now because I don't see a reason to change that

I'm ok with removing alphabet letters but not country names, remember that there are many websites, games that need to display just country names for some reason

Other than those remarks everything else is a great idea

@samayo
Copy link
Owner Author

samayo commented Aug 4, 2023

Btw I was recently thinking to give chatgpt the Wikipedia section that contains the data and ask it to generate the python code to convert the data from html to JSON and use that script every month to look for more updates.

The script would be made using python with scrappy I have an unfinished version of it in my local.

Once chatgpt creates the script and it works we upload the script to a server and with cronjob run it every month to scrap and send a pr request

That's what I thought initially, feel free to work upon the idea of provide your own

@kennarddh
Copy link

kennarddh commented Aug 5, 2023

I have never used python since 2020 so I can't implement it in python.

Now I mostly use typescript with NodeJS.

Instead of vps we can use github action instead.

For Country name can we just use array? Like [ "A", "B" ].

Can we move this repo to new organization so I can add sdk If I can.

@kennarddh
Copy link

Where did you find the flag svg @samayo?

@jezmck
Copy link

jezmck commented Aug 7, 2023

Wherever they are, they 100% need to go through SVGO or a similar compressor.

@kennarddh
Copy link

I can't find the svg source to scrap. The image html always unstructured.

@samayo
Copy link
Owner Author

samayo commented Aug 7, 2023

It's from Wikipedia. Check each country's flag page, it will have SVG format

@kennarddh
Copy link

@samayo
Copy link
Owner Author

samayo commented Aug 7, 2023

I don't understand your question. You will find a .svg file on every Wikipedia page and that must be converted to base64 format. We store in this repo a base64 representation of the svg

@kennarddh
Copy link

kennarddh commented Aug 7, 2023

isn't svgo is for optimizing svg?

@samayo
Copy link
Owner Author

samayo commented Aug 7, 2023

You still need to right click on the flag and select "open image in new tab..." Then you will see this URL

https://upload.wikimedia.org/wikipedia/commons/a/a9/Flag_of_the_United_States_%28DoS_ECA_Color_Standard%29.svg

That is the SVG the one you linked is html page

@kennarddh
Copy link

What is svgo for?

@jezmck
Copy link

jezmck commented Aug 7, 2023

That one is actually okay, but some flags are massive files.
Try https://jakearchibald.github.io/svgomg/ on the more complex flags.

SVGOMG is just a GUI for SVGO.

@kennarddh
Copy link

Ok so wikipedia -> svgo optimize -> base64?

@kennarddh
Copy link

kennarddh commented Aug 7, 2023

@samayo Can I make the scrapper with typescript instead of python?

@samayo
Copy link
Owner Author

samayo commented Aug 7, 2023

I highly suggest python so I can contribute also but you decide. Where do you plan to host the script? Here or at your own GitHub page?

@kennarddh
Copy link

kennarddh commented Aug 7, 2023

@samayo If you want to use python I can't contribute.

@kennarddh
Copy link

I can create new repo so I can use typescript instead. If you don't want @samayo.

@samayo
Copy link
Owner Author

samayo commented Aug 7, 2023

I think I will give it a shot and you can also go ahead and try we can use one or the other or both. I am happy to get a regular pr from anyone

@kennarddh
Copy link

Ok i will create a pr later with typescript

@kennarddh
Copy link

@samayo any update for my previous question?

@samayo
Copy link
Owner Author

samayo commented Aug 11, 2023

@samayo Where to get Geo Coordinates?

all from wikipedia

@samayo
Copy link
Owner Author

samayo commented Aug 11, 2023

@samayo should the data include the country even though the data is null

Like [{ country: 'x', data: null }] do we need to include this?

yes we should definitely add the country, we use ca use null, none, false or 0
You can pick any format as long as it is consistent. I prefer null since 0 could confuse users with other data

@kennarddh
Copy link

@samayo Where to get Geo Coordinates?

all from wikipedia

Can you add the link or the wikipedia page? I can't find it.

@samayo
Copy link
Owner Author

samayo commented Aug 11, 2023

It seems I was wrong, it is not from wikipedia and the way the data is represented is not entirely optimal.

Can you use this instead? https://developers.google.com/public-data/docs/canonical/countries_csv

You can use other source.
In any case, this data is unlikely to change so you can even exclude it

@kennarddh
Copy link

It seems I was wrong, it is not from wikipedia and the way the data is represented is not entirely optimal.

Can you use this instead? https://developers.google.com/public-data/docs/canonical/countries_csv

You can use other source. In any case, this data is unlikely to change so you can even exclude it

That data is different with this https://github.com/samayo/country-json/blob/0c522ea1e7ae88e9a2dd979322fbf8c2814b0de6/src/country-by-geo-coordinates.json

The data in google doesn't have west, south, north, east

@samayo
Copy link
Owner Author

samayo commented Aug 11, 2023

It's fine, we can use whatever is close enough and if there is a need to improve it then we can do that later

@kennarddh
Copy link

kennarddh commented Aug 14, 2023

@samayo I have some problem that the country name in different wikipedia page have different names

For example

We can compare the url but I don't know will it still be different. But I'll try.

Edit 1

  • Cannot use url as id because some wikipedia page use redirect.

Edit 2

@samayo
Copy link
Owner Author

samayo commented Aug 14, 2023

I don't know about the redirect issue, but if you found the solution then it's good.
About country names being different on different pages, i think we have to use a custom code logic for that.
e.g., if(countryName = "Netherlands, Kingdom of the") {CountryName = "Netherland")

@kennarddh
Copy link

I don't know about the redirect issue, but if you found the solution then it's good. About country names being different on different pages, i think we have to use a custom code logic for that. e.g., if(countryName = "Netherlands, Kingdom of the") {CountryName = "Netherland")

If we use if like that the automation will be broken when the wikipedia page is updated and there is so many alias

@samayo
Copy link
Owner Author

samayo commented Aug 14, 2023

that's unlikely, i don't know what you are planning to use but using python and panda, to find the table you are looking for very easily.

Take a look at this https://medium.com/analytics-vidhya/web-scraping-a-wikipedia-table-into-a-dataframe-c52617e1f451

from step 5, it is very easy to get all tables in the page and target the table you need.

So it's unlikely any changes will break as far as I think.

@kennarddh
Copy link

I can resolve the redirect issue using this https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bredirects api.

@kennarddh
Copy link

@samayo In this wikipedia page https://en.wikipedia.org/wiki/List_of_country_calling_codes

Some countries have 2 or more codes

image

Which code should we include?

In the old json its just concat the 1 with 939 ignoring 787

{
  "country": "Puerto Rico",
  "calling_code": 1939
},

@samayo
Copy link
Owner Author

samayo commented Aug 22, 2023

@kennarddh
We have to use both separated by a comma, if you have better ideas let me know
Thanks

@kennarddh
Copy link

@kennarddh We have to use both separated by a comma, if you have better ideas let me know Thanks

We can use something like this

{
  "country": "example",
  "data": [1787, 1938]
}

@samayo
Copy link
Owner Author

samayo commented Aug 22, 2023

@kennarddh looks good for me

@kennarddh
Copy link

@samayo russia have like range code?

image

{
  "country": "russia",
  "data": [71, 72, 73, 74, 75, 78, 79]
}

Is this right?

@samayo
Copy link
Owner Author

samayo commented Aug 23, 2023

@kennarddh

I think it's better to use 7 (007) as the others seem to be extensions

https://countrycode.org/russia

@kennarddh
Copy link

kennarddh commented Aug 23, 2023

@kennarddh

I think it's better to use 7 (007) as the others seem to be extensions

https://countrycode.org/russia

image

Kazakhstan Also uses 7

image

And also this

@samayo
Copy link
Owner Author

samayo commented Aug 23, 2023

Yes the two first countries are not officially recognised.

Let's use 7 for Russia until something changes

@kennarddh
Copy link

kennarddh commented Aug 23, 2023

Yes the two first countries are not officially recognised.

Let's use 7 for Russia until something changes

image

And for this?

@samayo
Copy link
Owner Author

samayo commented Aug 23, 2023

1242

The Bahamas is an independent country but uses the US calling code +1 and extension 242

So we should use the full calling code 242

@kennarddh
Copy link

kennarddh commented Aug 23, 2023

1242

The Bahamas is an independent country but uses the US calling code +1 and extension 242

So we should use the full calling code 242

Its not consistent.

@samayo how to parse it if its not consistent?

@samayo
Copy link
Owner Author

samayo commented Aug 23, 2023

In that case let's use the value as seen as in Wikipedia, just the way it is written

1 (variation 1, variation 2)

This is harder than I thought but this is the right thing to do imo

@samayo
Copy link
Owner Author

samayo commented Aug 23, 2023

In that case let's use the value as seen as in Wikipedia, just the way it is written

1 (variation 1, variation 2)

This is harder than I thought but this is the right thing to do imo

Another question is should we include 00 or + or just nothing as the prefix, I say to include 001

So a JSON entry for US should be 001
For the Bahamas 001 (x, y,)

@kennarddh
Copy link

So use string instead of number @samayo?

And should we pad the x and y with 00?

And for russia should we leave it like x (y-z, c, d)

@samayo
Copy link
Owner Author

samayo commented Aug 23, 2023

Good question, yes we should use string but we should not pad x, y with with 00 because then it would mean 00100424

@kennarddh
Copy link

kennarddh commented Aug 23, 2023

Good question, yes we should use string but we should not pad x, y with with 00 because then it would mean 00100424

isn't it should be 00x if x is 1 char?

0x if x is 2 chars

and x if its more than 2 chars

And russia will be 007 (1–5, 8, 9)

@samayo
Copy link
Owner Author

samayo commented May 5, 2024

closing as no update needed

@samayo samayo closed this as completed May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants