Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

27 create db tables aka django modelspy #50

Merged
merged 16 commits into from
May 3, 2024

Conversation

JPMartinezClaeys
Copy link
Contributor

Here is a pass at the Django models to start ingesting data.
This is based on the diagram and md file that describes the data model for milestone 4.

The only difference is that since the subway information is at the station level and the bus data at the route level I changed the first field from a Foreign Key that matches to the TransitStation class to a CharField that we could still match to the TransitStation table.

If something in the model makes you think it will make ingestion too complicated, let me know so we can discuss how to adapt the models.

Copy link
Contributor

@jimenasalinas jimenasalinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Matches our data model!

n_started = models.IntegerField()
n_ended = models.IntegerField()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you JP! This looks great, the models are consistent with what we were expecting based on the documentation.

class TransitStation(models.Model):
station_id = models.CharField(max_length=30, primary_key=True)
location = models.PointField()
route = models.CharField(max_length=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a route name (like "Red Line") or route_id like "Chicago_CTA_R" or whatever it ends up being? Given that it's set up way, I take it we went with separate entities for each line for a multi-line stop

Copy link
Contributor

@mbjackson-capp mbjackson-capp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand Django well enough yet to do a detailed review of the syntax (can do a deeper dive with you or by myself this week if requested); approving this since nothing looks blatantly contrary to the data model.

Once we have proper route path ingestion (i.e. once #43 pull request is in) we may want to create a new issue for adding existing route paths to the db (unless I'm mistaken/missing it here or it's somehow "too late"); class ExistingRoute(models.Model) or similar; my guess is they'd need a MultiLineStringField() for route (to handle lines with branches, like the CTA Green line) and some other attributes like a string representing hex for route color.

@meganhmoore
Copy link
Contributor

Sorry for so many nitpick comments. I think the main thing that I was not detailed in checking on the tables is that each table needs a primary key (usually an id field) and the columns that reference other tables need to be the same type, ideally same name and be set as foreign keys. Otherwise all of the types look good as far as my understanding of max char lengths and things. Also Matt makes a good point about maybe adding a color field for ease of visualization but I'm not sure we need that immediately, I think it would be easy to tack on later.

Happy to sit tomorrow evening/wednesday and go through it if you have questions.

@meganhmoore
Copy link
Contributor

meganhmoore commented Apr 29, 2024

I don't understand Django well enough yet to do a detailed review of the syntax (can do a deeper dive with you or by myself this week if requested); approving this since nothing looks blatantly contrary to the data model.

Once we have proper route path ingestion (i.e. once #43 pull request is in) we may want to create a new issue for adding existing route paths to the db (unless I'm mistaken/missing it here or it's somehow "too late"); class ExistingRoute(models.Model) or similar; my guess is they'd need a MultiLineStringField() for route (to handle lines with branches, like the CTA Green line) and some other attributes like a string representing hex for route color.

I am not sure if this completely answers your question, and @JPMartinezClaeys correct me if I am misstating, but the TransitStation is composed of a station and a route which means that the green line will have a Roosevelt (or the id for Roosevelt) entry and the red line will have a Roosevelt (or the id for Roosevelt) entry, basically treating them as separate places. But the ridership will be at the level of all routes that belong to Roosevelt (or the id for Roosevelt) if that makes sense?

@JPMartinezClaeys
Copy link
Contributor Author

JPMartinezClaeys commented Apr 30, 2024

Sorry for so many nitpick comments. I think the main thing that I was not detailed in checking on the tables is that each table needs a primary key (usually an id field) and the columns that reference other tables need to be the same type, ideally same name and be set as foreign keys. Otherwise all of the types look good as far as my understanding of max char lengths and things. Also Matt makes a good point about maybe adding a color field for ease of visualization but I'm not sure we need that immediately, I think it would be easy to tack on later.

Happy to sit tomorrow evening/wednesday and go through it if you have questions.

When a primary key is not specified Django creates a primary key for the table which is basically an integer from 1 to N.
Regarding the references to other tables, the tricky part here is that on the TransitStation we are storing information at the bus stop/subway station but the ridership will match that level for subway but not for buses where the information is at the route level. It might get a bit messy, but I was thinking merging the ridership with station table on different columns depending on if its a bus or subway information (we can add a column that unifies this to make the merging less painful).

@JPMartinezClaeys
Copy link
Contributor Author

@jamesturk @divij-sinha

Hi, we wanted to add you as reviewers for this PR given that defining the models correctly will be key for ingestion.

We have the following tables in our model:

  • Demographic (Demographics for a census tract pulled from the ACS)
  • TransitStation (Representation of bus stops and subway stations)
  • TransitRoute (Representation of bus routes and subway lines)
  • TransitRidership (Ridership for bus routes and subway stations)
  • BikeStation (Representation of bike sharing docking stations)
  • BikeRidership (Ridership for a bike docking station)
  • Survey (Survey deployed in the platform)
  • SurveyAnswer (Answers to surveys)
  • PlannedRoute (Answers to "Plan your route feature")

We have two main questions that we would like to get more detailed feedback:

  1. There is a mismatch between our TransitStation table and the TransitRidership table, specifically on buses where the ridership level is at the route level and not at the bus stop level.
    The main question here is how should we work with references between the TransitStation and TransitRidership table. Given the mismatch between the tables I decided to switch from referring on the TransitRidership to the TransitStation table with a Foreign Key (which would enforce that every observation has as a column a TransitStation class) and instead I created a column that references the transit unit (bus route or subway station) that could later be matched to an observation of the TransitStation or TransitRoute table.
    Mainly here we would like to know if we should modify our models to fit a structure where we need to add a ForeignKey to the TransitRidership table to reference other tables (or add another model where we group ridership units containing just bus routes and subway stations).
  2. We were wondering the best way to store surveys that could be changing over time. Initially we were thinking of creating a class for "survey answers"where we would just store each answer to the survey and where the columns will mostly be the question the answer is referring to. Since this approach leaves the survey questions fixed –and therefore it could be slightly unrealistic–, we moved to creating a Survey and SurveyAnswer model such that we could modify the survey deployed if necessary.
    We are storing each survey now as a JSON where we can flexibly modify the number of questions a survey has. We were wondering if this works, and if there are any the best practices to continuously store forms information that is missing on our implementation.

The second issue is less time pressing given that there is no ingestion that is completely necessary to move forward with the project.

Thanks!

@jamesturk
Copy link
Member

For #2, this is a really good use of a JSONField generally speaking, I'd probably do something similar. A few notes on downsides to keep in mind:

  • If you need to do lookups on common fields between surveys that is hard without a data model change.
  • You no longer get strong enforcement of the relationship between survey & answer, so it'd be possible for them to diverge, I probably wouldn't let a survey with answers be edited for instance, since that'd create a mismatch between answers/questions after the fact.
  • They are a little trickier to work with, requiring custom operators since you can't always use the standard SQL ones, Django mostly smooths over this difference if you're accessing the field through it, but if you use any raw SQL, you'll need to learn the specific JSONFIeld operators.
  • not a concern here, but not all DBs support JSONfield, so using them limits portability

I'm taking a closer look at #1 now.


#################################
######## SURVEY MODELS ##########
#################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small note, these would typically be in their own app when they are this distinct. not necessary for the project, but wanted to mention as a "best practice"

@jamesturk
Copy link
Member

We can discuss a bit today, but I think it might make sense to just go with two different tables that each have a FK to their respective unit. I'm not sure what cross-unit analysis you plan to/need to do, so that might not be the right call, but it'd be preferable to have the model tied directly to the table in question, and aside from something called Generic Foreign Keys which are tough to work with and a bit of a hack, that'd require two different (though nearly identical) tables.



class Demographics(models.Model):
census_tract = models.CharField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember @benjaleivas was ingesting at the block level, so as long as he is good with aggregating to the tract level by the time it goes to the db and that wont lose info for the frontend needs (I am pretty sure this is ok but want to double check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to the block level

population = models.IntegerField()
# age = models.IntegerField()
median_income = models.IntegerField()
transportation_to_work = models.CharField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max chars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing it!

verbose_name="Means of Transportation to Work"
)
work_commute_time = models.FloatField(verbose_name="Time of commuto to work")
vehicles_available = models.IntegerField()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be a float median/mean estimate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this field is a count of vehicles

transportation_to_work = models.CharField(
verbose_name="Means of Transportation to Work"
)
work_commute_time = models.FloatField(verbose_name="Time of commuto to work")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would help at some point maybe from @benjaleivas to clarify the units commute in min? hour?

)
work_commute_time = models.FloatField(verbose_name="Time of commuto to work")
vehicles_available = models.IntegerField()
disability_status = models.IntegerField() # Check the type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe an enum? disabled, not disabled, temporarily disabled?

class TransitStation(models.Model):
station_id = models.CharField(max_length=30, primary_key=True)
location = models.PointField()
route = models.CharField(max_length=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think multiple routes have the same station_id (if we go along with the Roosevelt chat we had) so route or route_id would also need to be set as a primary key

"""
Class that represent subway lines and bus routes
"""
city = models.CharField(max_length=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to set cities as a choice of ['CHICAGO', 'NYC', 'PORTLAND'] like this https://docs.djangoproject.com/en/5.0/ref/models/fields/#choices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the model

Class that represents bus stops and station complex
(i.e. CTA - Roosevelt)
"""
city = models.CharField(max_length=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same choice type comment here as above

"""
station = models.ForeignKey(TransitStation, on_delete=models.PROTECT)
route = models.ForeignKey(TransitRoute, on_delete=models.PROTECT)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!

route = models.ForeignKey(TransitRoute, on_delete=models.PROTECT)
date = models.DateField()
ridership = models.IntegerField()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For @jamesturk and @divij-sinha we decided to go with these RidershipRoute and RidershipStation tables because we realized that depending on the city the ridership metrics may occur at different levels (i.e. I think Portland may have route level ridership for the subway lines whereas the other cities have them at the station level).

For now it seems like we will have to create one view for Portland ridership and one view for Chicago/NYC ridership since they will pull from different tables so if you suggest handling this differently, i.e. add a city column and mode to each table or an association table please advise!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JPMartinezClaeys depending on the response to this question, the TransitModes might need to be accessible by Ridership instead of nested within TransitRoute

@JPMartinezClaeys
Copy link
Contributor Author

@jamesturk @divij-sinha we made some modifications to the model after the conversation we had yesterday.
We ended up modifying the TransitStation model to represent as one observation one "transit station complex" (i.e. CTA-Roosevelt) and we created another table that is a relational table where we map each station to the routes it serve (i.e CTA-Roosevelt Red, CTA-Roosevelt Orange).
For the ridership we splitted into Ridership by Station and by Route, and while we would expect (if we ever expand this project) that most cities would have subway ridership information at the station level, we didn't label the route level ridership as "Bus data" such that if a city has the subway information at the line level to include it that way.

@@ -124,7 +130,7 @@

LANGUAGE_CODE = "en-us"

TIME_ZONE = "UTC"
TIME_ZONE = "America/Chicago"
Copy link
Contributor

@meganhmoore meganhmoore May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want a gut check on this. I don't think we currently have anything more granular than day-level dates so this shouldn't cause huge problems, but if we were to have any daily schedule data or if we had users submit the timeframe they want to take a particular route we will have to translate to Chicago time.

Usually when there is conversion I have seen everything get converted to UTC so that we know it is all fully standardized, but if you are ok with keeping Chicago then that is fine, we will just have to note that in future hourly data for non-chicago places will need to be translated.

@JPMartinezClaeys JPMartinezClaeys merged commit e8912be into main May 3, 2024
1 check passed
@JPMartinezClaeys JPMartinezClaeys deleted the 27-create-db-tables-aka-django-modelspy branch May 3, 2024 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create DB Tables aka Django models.py
5 participants