In my project, there are two different programs: one is for pre-processing and the other one is for building the B-Tree of the .txt file for searching name. The two programs are all console applications but achieve the functions which are requested by professor. As the professor said in class, a best way to go beyond the RAM/main memory is the B-Tree. In my program, the pre-process program is in the Pre-process folder. In this folder, there are two requested folders: “input” & “processed”. In my pre-processing program, the getSubFile() is used for scanning all the .txt files in the “input” folder line by line, (.txt files are named as 1.txt, 2.txt, 3.txt and so on) and write each line into other .txt files which are in the “processed” folder. As we all known that, there is a relative frequency of letters in the English. So, after getting each line, I tried to get the first letter of each person’s given name (such as the first letter of “Shuying Zhang” is “S”). Then, write the information into the right .txt file (Shuying Zhang is in the S.txt file). The names of processed .txt files are in the image below. From C to Z in figure 2, we could see that letters from C to Z are not common used in English. We could also consider that it is suitable for English names. From the bar chart we could know that letter E is the most-used letter but the amount of letter C, U and M almost equals letter E’s amount. So, I decided to new a txt file for all names whose first letter is C, U or M and named it C_U_M.txt. The {W, G, F}, {Y, P, B} and {V, K, J, Q, X, Z} files are similar with the reason above. The BTree-DS_Pro_2.java will be run after pre-processing. In this program, there is a function for getting the first letter of the name which is typed-in in the console. After getting the first letter, the program could know which .txt file should be searched. For example, when I type in the Shuying Zhang in the console, the program will know that the information of Shuying Zhang is in the S.txt in “processed” folder. Then, build a B-Tree of this file for searching. The get() function is used for searching the value for the typed-in name. Considering millions of names: If we want to the program work if there are 200 given files instead of 20. It could also finish the job after making a little change. The idea is we could get the first letter and second letter of each person’s name (“Sh” or “Shu” for “Shuying Zhang”) and classify them into different txt files. For example, Shuying Zhang is in the SH.txt, Amrinder Arora is in the AM.txt, Iswarya Parupudi is in the IS.txt. In this situation, there are 26*26 = 676 files. If there are millions of information (we could consider that all the names are classify in each files in an average situation), only thousands of names are in one file. If you think that there are so many files, merge some files which have little information (such as ZZ.txt, QQ. txt, XX.txt and so on). If there are so many items in one files (such as EE.txt), spilt it into two or three. In addition, like the Figure 1 above, if a file is very big (such as the W_F_G.txt or H.txt), spilt it to two or three .txt files (W.txt, F. txt and G.txt). If we find that some files are very small, merge them into one file (example: merge I.txt, O.txt and N.txt into I_O_N.txt). How to run my programs: In my project 2, there are two programs: Pre-process.java in Pre-process folder and BTree_DS_Pro_2.java. You should run the Pre-process.java first and then the BTree_DS_Pro_2.java. All of them are coded on NetBeans IDE. If they could not be run on Eclipse, please run them on Netbeans IDE.
sherry900105/P2
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|