Skip to content

3. Project Architecture

frhino edited this page Jan 13, 2019 · 5 revisions

Getting Started

We are developing using Python and Jupyter Notebooks, and our production code is in the "/main" folder.

  • /data : will include any project-specific files used as lists or dictionaries by the code files. Sub-folders will organize external and manually produced datasets such as any data files we want to keep prior to implementing a DBMS.
  • /code : will include jupyter notebooks that perform tasks such as
    • pulling data from Twitter
    • cleaning the data for pre-processing
    • pre-processing text data, such as removing stop words, stemming/lemmatization, NER
    • ML algorithms for classifying untrained data by topic and subject
    • storing scores, hyperparameter settings, updated classifications
    • outputting results to output folder
  • /pipeline : as code is run, respective output files will be created here. Please refer to this article for an explanation of the organization.
  • /output : will contain final file output from code to be used as
    • user research data,
    • reports/presentations,
    • feeding to applications for in-app content (though this might require a DBMS be implemented) Use "/sandbox" folder for storing experiments and playing around. Right now, everything in the "/twitter" folder should go in "/sandbox." "/outreach" is for organizing materials for producing presentations.

When the platform is complete, it will be used by cloning the repo, adding project-specific files, and running the code in the "main/code" folder. A team may use one or more notebooks in the "/main/code" folder to accomplish an end-to-end analysis project.

Clone this wiki locally