Skip to content

telvis07/10_capstone

Repository files navigation

Final project for the Data Science Specialization Capstone Course

Links

Files

analysis.R : code to generate ngrams

  • fetch_capstone_data : fetch the Capstone data
  • preprocess_entries : perform text preprocessing and data cleanup
  • get_docterm_matrix : function used to generate ngram model and compute the Maximum Likelihood Estimate for each ngram.

search_with_dataframes.R : code to build ngram models and perform search using the models

  • ngram_language_modeling_with_data_frames : train models on 2-grams, 3-grams and 4-grams on sampled data
  • multi_search_tree_with_data_frames : predict function to estimate the next word for an input. Performs stupid backoff from ngram-4, to ngram-3 or ngram-2 if a model yields no results.
  • predict_test_data : predict the model accuracy for test data
  • generate_queries_and_answers, generate_queries_and_answers_from_csv, generate_quiz_1_data, generate_quiz_2_data : methods to generate test data for predict_test_data
  • build_ngram_4_partition : experimental code to build a model with 100% of the data

grid_search.R

  • grid_search : Attempt to find to the optimal value for "ngram coverage" to prune the ngram models using a grid search.

sample_data.R

  • sample_capstone_data, generate_sample_files : methods to generate samples of raw training data.

See steps.md for steps to build the ngram models.

About

Source from the data science capstone course to build ngram models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published